### Exploratory Data Analysis for news consumption (Part 2)
In this second part of our exploratory data analysis, we want to find out more about the articles, so let's start with loading the data:

In [None]:
import pandas as pd
import numpy as np
import scipy.sparse as sp

import plotly.express as px
import plotly.graph_objects as go

from progressbar import ProgressBar

In [None]:
news = pd.read_csv("../data/MINDlarge_train/news_processed.csv")
behaviors_by_date = pd.read_csv("../data/MINDlarge_train/behaviors_by_date_large.csv")
category_weekday_df = pd.read_csv("../data/MINDlarge_train/category_weekday_df.csv")

In [None]:
behaviors_by_date.history = behaviors_by_date.history.str.split(' ')
behaviors_by_date.impressions = behaviors_by_date.impressions.str.split(' ')

In [None]:
behaviors_by_date_np = behaviors_by_date.to_numpy()

We have information as to what was being recommended at a certain session and what was being clicked at that session, (although we don't know on what ground these recommendations were carried out). So first of all, we can **identify the most clicked articles**. To do that, we count the clicked article IDs, sort them from highest to lowest click numbers, and map the IDs to the titles and their categories:

In [None]:
clicked=[]
for row in behaviors_by_date_np:
    for impression in row[4]:
        if impression[-1] == '1':
            clicked.append(impression[:-2])

In [None]:
clicked[:10]

In [None]:
clicked_df = pd.DataFrame(clicked)

In [None]:
clicked_df.shape

In [None]:
most_clicked = clicked_df.iloc[:, 0].value_counts()

In [None]:
most_clicked_df = pd.DataFrame(most_clicked)

In [None]:
most_clicked_df.reset_index(inplace=True)
most_clicked_df.columns = ['article_id', 'clicks']

In [None]:
most_clicked_df

In [None]:
most_clicked_df.iloc[1,1]

In [None]:
news_np = news.to_numpy()

In [None]:
article_ids = []
titles = []
for row in news_np:
    article_ids.append(row[0])
    titles.append(row[3])
    
title_dict = dict(zip(article_ids, titles))        

In [None]:
keys = []
values = []
for row in news_np:
    keys.append(row[0]) 
    values.append(row[1])

category_dict = dict(zip(keys, values))
    

In [None]:
for i, article_id in enumerate(most_clicked_df.article_id[:5]):
    print(title_dict[article_id])
    print(most_clicked_df.iloc[i, 1])
    print(category_dict[article_id]) 

Okay, so these are obviously not earth-shattering news, but that's what people seem to be interested in.
With the **article on 5th place having only 60% of clicks of the top position**, let's look **how clicks are distributed** in general:

In [None]:
article_clicks = px.histogram(most_clicked_df, x= 'clicks', nbins=250,
                                color_discrete_sequence=['lime'],
                              marginal='rug',
                                labels={'category': 'Categories', 'weekday': 'Weekdays', 'clicks': 'Number of Clicks'})
article_clicks.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
'font_color' : 'white'
})


Oh wow! We can clealry see that there is so **many articles having relatively few clicks vs. a few articles having a lot of clicks**. Let's now check what were the **most read articles in the user histories**:

In [None]:
articles_in_history = []
for row in behaviors_by_date_np:
    for article_id in row[3]:
        articles_in_history.append(article_id)

In [None]:
len(articles_in_history)

In [None]:
articles_in_history[:5]

In [None]:
from collections import Counter

In [None]:
articles_in_history_count = Counter(articles_in_history)

In [None]:
articles_in_history_count['N59850']

In [None]:
articles_in_history_count = sorted(articles_in_history_count.items(),key=lambda item: item[1], reverse=True)

In [None]:
for pair in articles_in_history_count[:5]:
    print(title_dict[pair[0]])
    print(pair[1])
    print(category_dict[pair[0]])   

Interestingly, we now also find an article that can be considered of political content. Unfortunately, we also find a **cleansing artifact**: when remapping the redundant article IDs, we also homogenized a daily cartoon due to it's having the same title every day! This won't be too much of a problem for our models, but we clearly **need to dethrone this impostor**:

In [None]:
for pair in articles_in_history_count[1:6]:
    print(title_dict[pair[0]])
    print(pair[1])
    print(category_dict[pair[0]])

Like we did for clicked articles, let's check the **distribution of read articles in histories**:

In [None]:
articles_read = []
count=[]
for pair in articles_in_history_count:
    articles_read.append(pair[0])
    count.append(pair[1])

In [None]:
articles_read_df = pd.DataFrame(list(zip(articles_read, count)), columns = ['article_id', 'count'])

In [None]:
articles_read_df.head()

In [None]:
articles_read = px.histogram(articles_read_df, x= 'count', nbins=100,
                                color_discrete_sequence=['deeppink'],
                              marginal='rug',
                                labels={'category': 'Categories', 'weekday': 'Weekdays', 'count': 'Number of Clicks'})
articles_read.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
'font_color' : 'white'
})


Okay, so here we have pretty much the **same situation like in clicked articles**: very few very often read articles vs a whole lot of less frequently read ones. 

Let's now check the **ratio of clicked vs. the total of suggested articles at specific sessions**:

In [None]:
behaviors_by_date['impression_count'] = behaviors_by_date.impressions

In [None]:
behaviors_by_date['impression_count'] = behaviors_by_date.impression_count.map(len)

In [None]:
behaviors_by_date.head()

In [None]:
clicked, non_clicked = [], []

for row in behaviors_by_date_np:
    clicked_per_session = []
    non_clicked_per_session = []
    for article_id in row[4]:
        if article_id[-1] == '1':
            clicked_per_session.append(article_id[:-2])
        if article_id[-1] == '0':
            non_clicked_per_session.append(article_id[:-2])
    clicked.append(clicked_per_session)
    non_clicked.append(non_clicked_per_session)
            
            

In [None]:
behaviors_by_date['clicked'] = pd.Series(clicked)

In [None]:
behaviors_by_date['non_clicked'] = pd.Series(non_clicked)

In [None]:
behaviors_by_date.head()

In [None]:
behaviors_by_date['click_length'] =   behaviors_by_date.clicked.map(len)
behaviors_by_date['non_click_length'] = behaviors_by_date.non_clicked.map(len)

In [None]:
behaviors_by_date_np_2 = behaviors_by_date.to_numpy()

In [None]:
ratios = []
for row in behaviors_by_date_np_2:
    ratio = row[12] / row[8]
    ratios.append(round(ratio,2))
    
    
    
    

In [None]:
behaviors_by_date['click_ratio'] = pd.Series(ratios)

In [None]:
behaviors_by_date.click_ratio.describe()

In [None]:
behaviors_by_date.head(3)

In [None]:
behaviors_by_date.to_csv("../data/MINDlarge_train/beahviors_by_date_clicks.csv", index=False)

In [None]:
behaviors_by_date = pd.read_csv('../data/MINDlarge_train/beahviors_by_date_clicks.csv')

In [None]:
click_ratios = px.histogram(behaviors_by_date, x='click_ratio', 
                            labels={'click_ratio': 'Proportion of Clicked Suggestions'}, color_discrete_sequence=['aqua']
                            )


click_ratios.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
'font_color' : 'white'
})

As we can see, **at most of the sessions less than one of ten suggested articles has been clicked**. There also seem to be some special circumstances under which every second article has been clicked, although we cannot (and fortunately don't need to) reconstruct these today. **This concludes our exploratory data analysis**. See you in the next notebook, where we will be building our first, more conventional recommender systems!