### Exploratory Data Analysis for news consumption (Part 1)
In this notebook, we will do some exploratory data Analysis. Since our user data is completely anonymized, this part of our project plays a rather minor role. We still want to try to **find some insights into online news consumption with the avavailable data** though. So let's start with loading the data:

In [None]:
import pandas as pd
import numpy as np
import scipy.sparse as sp

import plotly.express as px
import plotly.graph_objects as go

from progressbar import ProgressBar

In [None]:
behaviors = pd.read_csv("../data/MINDlarge_train/behaviors_processed.csv")
news = pd.read_csv("../data/MINDlarge_train/news_processed.csv")

In [None]:
behaviors.head()

At first, we will check wether there is some **weekday sepcific click behavior**. For this task, we want to transform the times of the online sessions to the **pandas datetime format**, so that we can also extract the specific weekday:

In [None]:
behaviors['time'] = pd.to_datetime(behaviors['time'])

In [None]:
behaviors.head(3)

In [None]:
behaviors_by_date = behaviors.sort_values(by=['time']).copy()

In [None]:
behaviors_by_date['date'] = behaviors_by_date['time'].dt.date

In [None]:
behaviors_by_date['weekday'] = behaviors_by_date['time'].dt.weekday

In [None]:
behaviors_by_date.weekday = behaviors_by_date.weekday.replace([0,1,2,3,4,5,6], 
                                                              ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
                                                              'Saturday', 'Sunday'])

In [None]:
behaviors_by_date.head(3)

In [None]:
behaviors_by_date.tail(3)

Now that we have this extra information, let's **save this expanded dataframe to a .csv file**:

In [None]:
behaviors_by_date.to_csv("../../data/MINDlarge_train/behaviors_by_date_large.csv", index=False)

Given that extracting the weekdays worked, let's check the **general click behavior for our time span**:

In [None]:
hist_date = px.histogram(behaviors_by_date.weekday)

In [None]:
hist_date.show()

Apparently, MSN is generating **roughly twice as many clicks on working days than on weekend days**. Let's now check wether there are **dynamics in the compostioin of clicked categories** . In order to do this, we need to map the articles' IDs to their categories:

In [None]:
behaviors_by_date.history = behaviors_by_date.history.str.split(' ')
behaviors_by_date.impressions = behaviors_by_date.impressions.str.split(' ')

In [None]:
behaviors_by_date_np = behaviors_by_date.to_numpy()

In [None]:
clicked=[]
for row in behaviors_by_date_np:
    clicked_per_user=[]
    for impression in row[4]:
        if impression[-1] == '1':
            clicked_per_user.append(impression[:-2])
    clicked.append(clicked_per_user)

In [None]:
clicked[:10]

In [None]:
news.head(3)

In [None]:
news_np = news.to_numpy()

In [None]:
keys = []
values = []
for row in news_np:
    keys.append(row[0]) 
    values.append(row[1])

category_dict = dict(zip(keys, values))
    
    

In [None]:
clicked_categories = []
for clicks in clicked:
    clicks_per_session=[]
    for click in clicks:
        clicks_per_session.append(category_dict[click])
    clicked_categories.append(clicks_per_session)


In [None]:
clicked_categories[:5]

In [None]:
weekdays = []
for row in behaviors_by_date_np:
    weekdays.append(row[7])

In [None]:
len(weekdays), len(clicked_categories)

In [None]:
category = []
weekday=[]
for i, cats in enumerate(clicked_categories):
    for cat in cats:
        category.append(cat)
        weekday.append(weekdays[i])
    
    

In [None]:
len(category), len(weekday)

In [None]:
category_weekday_df = pd.DataFrame(list(zip(category, weekday)),
                                  columns= ['category', 'weekday'])

Now that we have a **dataframe that combines every single clicked category with a weekday**, let's save it to a .csv file:

In [None]:
category_weekday_df.to_csv("../../data/MINDlarge_train/category_weekday_df.csv", index=False)

In [None]:
#news = pd.read_csv("../data/MINDlarge_train/news_processed.csv")
#behaviors_by_date = pd.read_csv("../data/MINDlarge_train/behaviors_by_date_large.csv")
#category_weekday_df = pd.read_csv("../data/MINDlarge_train/category_weekday_df.csv")

We can now plot the **Weekdays with the respective proportions of clicked categories**:

In [None]:
hist_clicks_cats = px.histogram(category_weekday_df, x= 'weekday',color= 'category',
                                color_discrete_sequence=px.colors.cyclical.HSV, 
                                labels={'category': 'Categories', 'weekday': 'Weekdays', 'count': 'Number of Clicks'})
hist_clicks_cats.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
'font_color' : 'white'
})

The labeling is intentionally set to white, so that we can use it in combination with the dark background of our presentation. But there are **too many small categories**! Let's **remap the smaller to more general ones**:

In [None]:
category_weekday_df.category.value_counts()

In [None]:
news.category.value_counts()

In [None]:
category_weekday_df.category.value_counts()

In [None]:
rename_dict = {'news': 'News', 'sports': 'Sports', 'lifestyle': 'Lifestyle', 'foodanddrink': 'Lifestyle', 'health': 'Lifestyle',
              'finance': 'Finance', 'entertainment': 'Entertainment', 'music': 'Entertainment', 'tv': 'Entertainment', 'video': 
              'Entertainment', 'movies': 'Entertainment', 'travel': 'Travel', 'kids': 'Other', 'northamerica': 'Other', 'middleeast': 'Other',
              'games': 'Other', 'autos': 'Other', 'weather': 'Weather'}

In [None]:
rename_dict['foodanddrink']

In [None]:
category_weekday_df['unified'] = [rename_dict[x] for x in category_weekday_df.category]

In [None]:
category_weekday_df.head()

In [None]:
news['unified'] = [rename_dict[x] for x in news.category]

In [None]:
news.head(3)

In [None]:
hist_clicks_unified = px.histogram(category_weekday_df, x= 'weekday',color= 'unified',
                                color_discrete_sequence=px.colors.cyclical.HSV, 
                                labels={'unified': 'Categories', 'weekday': 'Weekdays', 'count': 'Number of Clicks'})
hist_clicks_unified.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
'font_color' : 'white'
})

With the more general categories, we have a much clearer view on how they are distributed over the weekdays. The only noticable difference though, is that **sports articles seem to be clicked more frequenlty during working days**. This could have to do with sports events *happening* on the weekend, whereas their coverage and and reports on surrounding events take place during the week. 

Unfortunately, we can't say on what times specific articles are available, but we can **compare all the available articles and their categories in our *news* dataset to the clicked categories**. In order to do this, let's write out the respective proportions:

In [None]:
news_values = news.unified.value_counts().to_frame()

In [None]:
news_values['share'] = [x/news.shape[0] for x in news_values.unified]

In [None]:
news_values 

In [None]:
clicks_values = category_weekday_df.unified.value_counts().to_frame()

In [None]:
clicks_values['share_clicks'] = [x/category_weekday_df.shape[0] for x in clicks_values.unified]

In [None]:
clicks_values = clicks_values.sort_index()
news_values = news_values.sort_index()

In [None]:
clicks_values['cat'] = clicks_values.index
news_values['cat'] = news_values.index

In [None]:
news_values['share_clicks'] = clicks_values['share_clicks']

In [None]:
news_values

In [None]:
comparison = go.Figure(data=[
    go.Bar(name='Proportion of Category in All Articles', x=news_values.cat, y=news_values.share, marker_color=px.colors.qualitative.Alphabet[25] ),
    go.Bar(name='Proportion of Category in Clicked Articles', x=news_values.cat, y=news_values.share_clicks, marker_color=px.colors.qualitative.Alphabet[6])
])
comparison.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
'font_color' : 'white'
})

As we can see, there are clearly a couple of categories over-represented in the clicking behaviors. Whereas proper news articles are clicked pretty much proportional to their general occurence, **sports articles are actually much more seldomly clicked** than they're available. It's the opposite situation for **entertainment and lifestyle articles** (with finance showing the same tendency), which are **being more clicked proportionally**. This information could potentially be used when it comes to fine tuning recommender systems (substantially -- not hyperparameter wise).

Our exploratory data analysis will continue in the second notebook!