-----
# Code Information:
<br>

|     |     |
| ----- | ----- |
| __Script:__ | MIND_Article_Recommender.ipynb |
| __Version:__ | Python 3.7.6 |
| __Author:__ | Matthew Wight |
| __Email:__ | wight_matthew@bah.com |
| __Published:__ | 11 March 2021 |


# Code Description
![title](data/MIND/MIND.png)

The MIcrosoft News Dataset (MIND) is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website. The mission of MIND is to serve as a benchmark dataset for news recommendation and facilitate the research in news recommendation and recommender systems area. Link: https://msnews.github.io/

MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID. 

This code is designed to use Natural Language Processing (NLP) to identify a list of similar articles based on a sample article title submitted by the user.

-----
# 1.0 - Setup Environment

In [1]:
print("____________________________________________________________________________")
print("1.0 - Setup Environment: \n")

"""
DESCRIPTION:
This section installs and imports required Python packages. These are the versions 
of the packages used at time of build. Installs should only be used if your system 
has not been previously set up. Consult the script author or your local Python 
specialist for initial system configuration.

INSTRUCTIONS:
- Add/Remove packages as needed to further customize script for your needs.
- Verify versions of packages needed.

"""

#------------------------------------------------------------------------------------
# Step 1.1 - Start timing metrics

import time
code_start = time.time()
cell_start = time.time()


#------------------------------------------------------------------------------------
# Step 1.2 - Install packages

print('  ↓ Installing packages...')

""" 
To install packages uncomment out line with (Ctl + /) and run cell. 
Apply --user flag if necessary to install packages.

"""
# !pip install pathlib pandas numpy matplotlib seaborn warnings plotly 
# !pip install altair nltk wordcloud sklearn lightgbm xgboost eli5 pprint


#------------------------------------------------------------------------------------
# Step 1.3 - Import packages

print('  ↓ Importing packages...')

# Helper packages.
import sys
import os
import pathlib
from pathlib import Path
import copy
import re
from re import sub
import random
import pandas as pd
from pandas.core.common import SettingWithCopyWarning
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)
warnings.filterwarnings("ignore",category=DeprecationWarning)
warnings.filterwarnings("ignore",category=SettingWithCopyWarning)

# Imaging, visualization, and graphic tools.
import plotly
import plotly.express as px
import plotly.graph_objects as go
plotly.offline.init_notebook_mode(connected=True)
from IPython.display import clear_output
import ipywidgets as widgets
from IPython.display import display, HTML
from PIL import Image
import altair as alt

# Packages with tools for text processing.
import nltk
import nltk.data
# nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer 
from wordcloud import WordCloud

# Below libraries are for feature representation using sklearn.
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import train_test_split
import sklearn.exceptions
warnings.filterwarnings("ignore", category=sklearn.exceptions.UndefinedMetricWarning)

# Packages for working with text data.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Below libraries are for similarity matrices using sklearn.
from sklearn import metrics
from sklearn.metrics import precision_score, f1_score, classification_report
from sklearn.metrics.pairwise import cosine_similarity  
from sklearn.metrics import pairwise_distances
from sklearn.model_selection import train_test_split

# Scikit-learn classifiers.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

# Tools to debug machine learning classifiers and explain their predictions.
import eli5
from eli5.lime import TextExplainer
from eli5.lime.samplers import MaskingTextSampler

# Packages for getting data ready for and building a LDA model.
import gensim
from gensim import corpora, models
from pprint import pprint


print('  ✓ SUCCESS: Packages imported! ')
print('--------------------------------------------- \n')

#------------------------------------------------------------------------------------
# Display environment settings

print("   Current Directory:    ", os.getcwd())
print("   Python Executable:    ", sys.executable)
print("   Python Version:       ", sys.version.replace('\n',''))
print("   Python Version Info:  ", sys.version_info, "\n")


#------------------------------------------------------------------------------------
# Button toggle

javascript_functions = {False: "hide()", True: "show()"}
button_descriptions  = {False: "Show code", True: "Hide code"}

def toggle_code(state):
    """Toggles the JavaScript show()/hide() function on the div.input element."""
    
    output_string = "<script>$(\"div.input\").{}</script>"
    output_args   = (javascript_functions[state],)
    output        = output_string.format(*output_args)
    display(HTML(output))

def button_action(value):
    """Calls the toggle_code function and updates the button description."""
    state = value.new
    toggle_code(state)
    value.owner.description = button_descriptions[state]

state = False
toggle_code(state)

button = widgets.ToggleButton(state, description = button_descriptions[state])
button.observe(button_action, "value")

display(button)


#------------------------------------------------------------------------------------
# Calculate cell runtime

cell_end = time.time()
seconds = cell_end - cell_start
minutes, seconds = divmod(seconds, 60)
hours, minutes = divmod(minutes, 60)
days, hours = divmod(hours, 24)

print('---------------------------------------------')
print(' Process complete! Runtime: {:0>2}d, {:0>2}h, {:0>2}m, {:05.2f}s'.
      format(int(days),int(hours),int(minutes),seconds),"\n\n")


____________________________________________________________________________
1.0 - Setup Environment: 

  ↓ Installing packages...
  ↓ Importing packages...


  ✓ SUCCESS: Packages imported! 
--------------------------------------------- 

   Current Directory:     C:\02_Data_Science\04_Projects\bah-intermediate\Capstone
   Python Executable:     C:\Anaconda3\python.exe
   Python Version:        3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
   Python Version Info:   sys.version_info(major=3, minor=7, micro=6, releaselevel='final', serial=0) 



ToggleButton(value=False, description='Show code')

---------------------------------------------
 Process complete! Runtime: 00d, 00h, 00m, 03.80s 




-----
# 2.0 - Import Data

In [2]:
display(button)

print("____________________________________________________________________________")
print("2.0 - Import Data: \n")

"""
DESCRIPTION:
This section reads user defined input files into 'pandas' dataframes.
This process structures the data to allow for efficient editing and merging in later steps.

INSTRUCTIONS:
- Customize your input paths between quotation marks with a forward slash.
- Data inputs are read into 'pandas' dataframes and these frames can be viewed once read into memory
- Update data variables prior to running cell.

"""

cell_start = time.time()

#------------------------------------------------------------------------------------
# Step 2.1 - Set directory path

print('  ↓ Setting variables...')

# Set data directory path
data_dir = Path.cwd().parent / "Capstone" / "data" / "MIND"
os.chdir(data_dir)

# Set input file path
input1 = "C:/02_Data_Science/04_Projects/bah-intermediate/Capstone/data/MIND/news.tsv"


print('     > Data Directory:  ', data_dir)
print('     > Data Input File: ', input1)

#------------------------------------------------------------------------------------
# Step 2.2 - Generate dataframe

print('  ↓ Generating dataframe... \n')

data = pd.read_csv("news.tsv", header=None, sep='\t')

data.columns=['News ID',
             'Category',
             'SubCategory',
             'Title',
             'Abstract',
             'URL',
             'Title Entities',
             'Abstract Entities']


#------------------------------------------------------------------------------------
# Calculate cell runtime

cell_end = time.time()
seconds = cell_end - cell_start
minutes, seconds = divmod(seconds, 60)
hours, minutes = divmod(minutes, 60)
days, hours = divmod(hours, 24)

print('---------------------------------------------')
print(' Process complete! Runtime: {:0>2}d, {:0>2}m, {:05.2f}s'.
      format(int(days),int(hours),int(minutes),seconds),"\n\n")


ToggleButton(value=True, description='Hide code')

____________________________________________________________________________
2.0 - Import Data: 

  ↓ Setting variables...
     > Data Directory:   C:\02_Data_Science\04_Projects\bah-intermediate\Capstone\data\MIND
     > Data Input File:  C:/02_Data_Science/04_Projects/bah-intermediate/Capstone/data/MIND/news.tsv
  ↓ Generating dataframe... 

---------------------------------------------
 Process complete! Runtime: 00d, 00m, 00.00s 




-----
# 3.0 - Exploratory Data Analysis

### Examine dataframe: Top 5 rows

In [3]:
display(button)

# Examine dataframe: Top 5 rows
data.head(5)


ToggleButton(value=True, description='Hide code')

Unnamed: 0,News ID,Category,SubCategory,Title,Abstract,URL,Title Entities,Abstract Entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


### Drop unwanted columns

In [4]:
display(button)

# Drop unwanted columns
data = data[["News ID","Category","SubCategory","Title","Abstract"]]
data.head()


ToggleButton(value=True, description='Hide code')

Unnamed: 0,News ID,Category,SubCategory,Title,Abstract
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the..."
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re..."


### Generate value counts for desired columns

In [5]:
display(button)

# Generate value counts for desired columns
cat = data[['Category','SubCategory']].value_counts()
print(cat)


ToggleButton(value=True, description='Hide code')

Category  SubCategory                    
news      newsus                             6564
sports    football_nfl                       5420
news      newspolitics                       2826
          newscrime                          2254
weather   weathertopstories                  2047
                                             ... 
finance   finance-home-loans                    1
          finance-homesandpropertysection       1
news      newsnational                          1
          narendramodi_opinion                  1
finance   finance-insidetheticker               1
Length: 283, dtype: int64


### Input index position to print desired title

In [6]:
# Input index position to print desired title
print(data['Title'][1])


50 Worst Habits For Belly Fat


### Generate Dataframe for Category Value Counts

In [7]:
display(button)

# Generate index for each value count
index=[]
for i in cat.index:
    index.append(np.array(i))
index=np.array(index)

# Generate custom dataframe from value counts index
df1 = pd.DataFrame(columns=['Category',
                            'Sub Category',
                            'Values'])
df1['Category']=index[:,0]
df1['Sub Category']=index[:,1]
df1['Values']=cat.values

# Examine dataframe: Top 5 rows
df1.head(5)


ToggleButton(value=True, description='Hide code')

Unnamed: 0,Category,Sub Category,Values
0,news,newsus,6564
1,sports,football_nfl,5420
2,news,newspolitics,2826
3,news,newscrime,2254
4,weather,weathertopstories,2047


# Exploratory Data Analysis

### Plotly: Bar chart
Visualize frequency of topic values by totals. Display Sub Category distribution by Category.

In [8]:
display(button)

px.bar(data_frame=df1,
       x='Category',
       y='Values',
       color='Sub Category')


ToggleButton(value=True, description='Hide code')

### Plotly: Histogram
Visualize 'data' distribution based on title length.

In [9]:
display(button)

title_list=[]
for topic in data['Title']:
    title_list.append(len(topic))
px.histogram(title_list,
             color=data['Category'])


ToggleButton(value=True, description='Hide code')

## Altair: Brush Scatter Plot - Interactive

In [10]:
display(button)

brush = alt.selection(type='interval')

sp2 = alt.Chart(df1).mark_point().encode(
    x='Values',
    y='Category:N',
    color=alt.condition(brush, 'Category', 
                        alt.value('grey')),
).add_selection(brush)

sp2


ToggleButton(value=True, description='Hide code')

## Altair: Scatter Plot - Interactive Dot Dash Plot
- Category

In [11]:
display(button)

# Configure the options common to all layers
brush = alt.selection(type='interval')
base = alt.Chart(df1).add_selection(brush)

# Configure the points
points = base.mark_point().encode(
    x=alt.X('Values', title='Count'),
    y=alt.Y('Category', title=''),
    color=alt.condition(brush, 'Category', alt.value('grey'))
)

# Configure the ticks
tick_axis = alt.Axis(labels=False, domain=False, ticks=False)

x_ticks = base.mark_tick().encode(
    alt.X('Values', axis=tick_axis),
    alt.Y('Category', title='Category Tick', axis=tick_axis),
    color=alt.condition(brush, 'Category', alt.value('lightgrey'))
)

y_ticks = base.mark_tick().encode(
    alt.X('Category', title='Category Tick', axis=tick_axis),
    alt.Y('Values', axis=tick_axis),
    color=alt.condition(brush, 'Category', alt.value('lightgrey'))
)

# Build the chart
y_ticks | (points & x_ticks)


ToggleButton(value=True, description='Hide code')

## Altair: Scatter Plot - Interactive Dot Dash Plot
- Category by Subcategory

In [12]:
display(button)

# Configure the options common to all layers
brush = alt.selection(type='interval')
base = alt.Chart(df1).add_selection(brush)

# Configure the points
points = base.mark_point().encode(
    x=alt.X('Values', title='Count'),
    y=alt.Y('Category', title=''),
    color=alt.condition(brush, 'Sub Category', alt.value('grey'))
)

# Configure the ticks
tick_axis = alt.Axis(labels=False, domain=False, ticks=False)

x_ticks = base.mark_tick().encode(
    alt.X('Values', axis=tick_axis),
    alt.Y('Category', title='Category Tick', axis=tick_axis),
    color=alt.condition(brush, 'Category', alt.value('lightgrey'))
)

y_ticks = base.mark_tick().encode(
    alt.X('Category', title='Category Tick', axis=tick_axis),
    alt.Y('Values', axis=tick_axis),
    color=alt.condition(brush, 'Category', alt.value('lightgrey'))
)

# Build the chart
y_ticks | (points & x_ticks)


ToggleButton(value=True, description='Hide code')

## Altair: Scatter Plot - Interactive Dot Dash Plot
- Subcategory by Category

In [13]:
display(button)

# Configure the options common to all layers
brush = alt.selection(type='interval')
base = alt.Chart(df1).add_selection(brush)

# Configure the points
points = base.mark_point().encode(
    x=alt.X('Values', title='Count'),
    y=alt.Y('Sub Category', title=''),
    color=alt.condition(brush, 'Category', alt.value('grey'))
)

# Configure the ticks
tick_axis = alt.Axis(labels=False, domain=False, ticks=False)

x_ticks = base.mark_tick().encode(
    alt.X('Values', axis=tick_axis),
    alt.Y('Category', title='Category Tick', axis=tick_axis),
    color=alt.condition(brush, 'Category', alt.value('lightgrey'))
)

y_ticks = base.mark_tick().encode(
    alt.X('Category', title='Category Tick', axis=tick_axis),
    alt.Y('Values', axis=tick_axis),
    color=alt.condition(brush, 'Category', alt.value('lightgrey'))
)

# Build the chart
y_ticks | (points & x_ticks)


ToggleButton(value=True, description='Hide code')

## Altair: Bar Charts
Visualize categories and subcategories horizontally

In [14]:
display(button)

click = alt.selection_multi(encodings=['color'])

bar = alt.Chart(df1).mark_bar().encode(
    x='count()',
    y='Category:N',
    color=alt.condition(click, 
                        'Sub Category', 
                        alt.value('lightgray')),
).add_selection(click)

bar


ToggleButton(value=True, description='Hide code')

## Altair: Line Chart

In [15]:
display(button)

line = alt.Chart(df1).mark_line().encode(
    x='Values',
    y='Category:N',
    color='Category',
    strokeDash='Category',
)

line


ToggleButton(value=True, description='Hide code')

## Sankey Diagram

In [16]:
display(button)

def genSankey(df1,cat_cols=[],value_cols='',title='Sankey Diagram'):
    
    # Maximum of 6 value cols -> 6 colors
    colorPalette = ['#4B8BBE','#306998','#FFE873','#FFD43B','#646464']
    
    labelList = []
    colorNumList = []
    
    for catCol in cat_cols:
        labelListTemp =  list(set(df1[catCol].values))
        colorNumList.append(len(labelListTemp))
        labelList = labelList + labelListTemp
        
    # Remove duplicates from labelList
    labelList = list(dict.fromkeys(labelList))
    
    # Define colors based on number of levels
    colorList = []
    for idx, colorNum in enumerate(colorNumList):
        colorList = colorList + [colorPalette[idx]]*colorNum
        
    # Transform df into a source-target pair
    for i in range(len(cat_cols)-1):
        if i==0:
            sourceTargetDf = df1[[cat_cols[i],cat_cols[i+1],value_cols]]
            sourceTargetDf.columns = ['source','target','count']
        else:
            tempDf = df1[[cat_cols[i],cat_cols[i+1],value_cols]]
            tempDf.columns = ['source','target','count']
            sourceTargetDf = pd.concat([sourceTargetDf,tempDf])
        sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()
        
    # Add index for source-target pair
    sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
    sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))
    
    # Creating the Sankey diagram
    data = dict(
        type='sankey',
        node = dict(
          pad = 30,
          thickness = 20,
          line = dict(
            color = "black",
            width = 0.5
          ),
          label = labelList,
          color = colorList
        ),
        link = dict(
          source = sourceTargetDf['sourceID'],
          target = sourceTargetDf['targetID'],
          value = sourceTargetDf['count']
        )
      )
    
    layout =  dict(
        title = title,
        font = dict(
          size = 10
        )
    )
       
    fig = dict(data=[data], layout=layout)
    return fig

fig = genSankey(df1,
                cat_cols=['Sub Category','Category'],
                value_cols='Values',
                title='MIND Dataset > Sankey Diagram')

plotly.offline.plot(fig, validate=False)


ToggleButton(value=True, description='Hide code')

'temp-plot.html'

-----
# 4.0 - Data Preprocessing

### Print length of dataframe before and after processing

In [17]:
display(button)

# Print length of 'data' dataframe before dropping duplicates
print('Number of articles before processing:',len(data))

# Remove duplicate values
data.drop_duplicates(subset=['Title'],inplace=True)

# Print length of 'data' dataframe after dropping duplicates
print('Number of articles after processing: ',len(data))

print('\n')

# Print sum of null cell values in column
print('Sum of null cell values: ')
data.isna().sum()


ToggleButton(value=True, description='Hide code')

Number of articles before processing: 51282
Number of articles after processing:  50434


Sum of null cell values: 


News ID           0
Category          0
SubCategory       0
Title             0
Abstract       2646
dtype: int64

### Process Data and Select Titles with more than 4 Words

In [18]:
display(button)

#-------------------------------------------------------------
print('Process Data and Select Titles: \n')
      
      
# Drop missing rows from 'data' dataframe
data.dropna(inplace=True)

# Print length of 'data' dataframe before processing
print('  - Number of articles before processing:',len(data))

# Get title with more than 4 words
print('  - Select titles with > 4 words...')
data = data[data['Title'].apply((lambda x: len(x.split())>=4))]

# Print length of 'data' dataframe after processing
print('  - Number of articles after processing: ',len(data))

# Create copy of data for further processing
df2 = data.copy(deep=True)


#-------------------------------------------------------------

print('---------------------------------------------')
print('  ✓ SUCCESS: Processes complete! \n \n')


ToggleButton(value=True, description='Hide code')

Process Data and Select Titles: 

  - Number of articles before processing: 47788
  - Select titles with > 4 words...
  - Number of articles after processing:  47661


---------------------------------------------
  ✓ SUCCESS: Processes complete! 
 



### Examine dataframe: Information

In [19]:
display(button)

# Examine dataframe: Information
df2.info()


ToggleButton(value=True, description='Hide code')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47661 entries, 0 to 51280
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   News ID      47661 non-null  object
 1   Category     47661 non-null  object
 2   SubCategory  47661 non-null  object
 3   Title        47661 non-null  object
 4   Abstract     47661 non-null  object
dtypes: object(5)
memory usage: 2.2+ MB


### Examine dataframe: Top 5 rows

In [20]:
display(button)

# Examine dataframe: Top 5 rows
df2.head()


ToggleButton(value=True, description='Hide code')

Unnamed: 0,News ID,Category,SubCategory,Title,Abstract
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the..."
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re..."


-----
# 5.0 - Text Preprocessing

In [21]:
display(button)

#-------------------------------------------------------------
print('Process Data and Select Titles: \n')
      
# Function to remove stopwords from a particular column and then tokenize it
print('  - Define function to remove stopwords and tokenize column...')

def rem_stopwords_tokenize(data,name):
      
    def getting(sen):
        example_sent = sen
        stop_words = set(stopwords.words('english')) 
        word_tokens = word_tokenize(example_sent) 
        filtered_sentence = [w for w in word_tokens if not w in stop_words] 
        filtered_sentence = [] 

        for w in word_tokens: 
            if w not in stop_words: 
                filtered_sentence.append(w) 
        return filtered_sentence
    x=[]
    for i in data[name].values:
        x.append(getting(i))
    data[name]=x

#-------------------------------------------------------------
print('  - Define and instantiate lemmatizer...')

lemmatizer = WordNetLemmatizer() 

# Define function to lemmatize all the words
def lemmatize_all(data,name):
    arr = data[name]
    a = []
    for i in arr:
        b=[]
        for j in i:
            x=lemmatizer.lemmatize(j,pos='a')
            x=lemmatizer.lemmatize(x)
            b.append(x)
        a.append(b)
    data[name]=a

#-------------------------------------------------------------
print('  - Process and lemmatize data...')

# Removing Stop words from 'Title' Column
rem_stopwords_tokenize(data,'Title')

# Lemmatize the 'Title' column
lemmatize_all(data,'Title')

# Make a copy of dataframe to use in the future
data4 = data.copy(deep=True)


#-------------------------------------------------------------
print('  - Convert data to string... \n')

def convert_to_string(data,name):
    t = data[name].values
    p = []
    for i in t:
        listToStr = ' '.join(map(str, i))
        p.append(listToStr)
    data[name]=p

# Convert 'data' dataframe back to string
convert_to_string(data,'Title')


#-------------------------------------------------------------

print('---------------------------------------------')
print('  ✓ SUCCESS: Processes complete! \n \n')


ToggleButton(value=True, description='Hide code')

Process Data and Select Titles: 

  - Define function to remove stopwords and tokenize column...
  - Define and instantiate lemmatizer...
  - Process and lemmatize data...
  - Convert data to string... 

---------------------------------------------
  ✓ SUCCESS: Processes complete! 
 



# 7.0 - Article Recommendations

### Select Desired Article Title

In [22]:
display(button)

# Examine dataframe: Top 10 rows
data.head(10)


ToggleButton(value=True, description='Hide code')

Unnamed: 0,News ID,Category,SubCategory,Title,Abstract
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth , Prince Charles , ...","Shop the notebooks, jackets, and more that the..."
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...
2,N61837,news,newsworld,The Cost Trump 's Aid Freeze Trenches Ukraine ...,Lt. Ivan Molchanets peeked over a parapet of s...
3,N53526,health,voices,I Was An NBA Wife . Here 's How It Affected My...,"I felt like I was a fraud, and being an NBA wi..."
4,N38324,health,medical,"How Get Rid Skin Tags , According Dermatologist","They seem harmless, but there's a very good re..."
5,N2073,sports,football_nfl,Should NFL able fine player criticizing offici...,Several fines came down against NFL players fo...
6,N49186,weather,weathertopstories,"It 's Orlando 's hot October ever far , cool t...",There won't be a chill down to your bones this...
7,N59295,news,newsworld,Chile : Three die supermarket fire amid protest,Three people have died in a supermarket fire a...
8,N24510,entertainment,gaming,Best PS5 game : top PlayStation 5 title look f...,Every confirmed or expected PS5 game we can't ...
9,N39237,news,newsscienceandtechnology,"How report weather-related closing , delay","When there are active closings, view them here..."


### Select title by index and copy for use in Article Recommender:

In [45]:
# Input index position to print desired title
print(data['Title'][1])


50 Worst Habits For Belly Fat


## Article Recommender: Bag-of-Words Model - CountVectorizer
CountVectorizer provides a way to tokenize a collection of text documents, build a vocabulary of known words, and encode new documents using that vocabulary.

In [46]:
display(button)

# Define Euclidean_Distance_based_model
def Euclidean_Distance_Based_Model(row_index, num_similar_items):
    cat = data['Category'][row_index]
    title = data['Title'][row_index]
    cat_data=data[data['Category']==cat]
 
    row_index2 = cat_data[cat_data['Title']==title].index
    headline_features = headline_vectorizer.fit_transform(cat_data['Title'].values)
    couple_dist = pairwise_distances(headline_features,headline_features[row_index2])
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    
    df = pd.DataFrame({'Headline':df2[df2['Category']==cat]['Title'].values[indices],
                       'Category':cat_data['Category'].values[indices],
                       'Abstract':cat_data['Abstract'].values[indices],
                       'Euclidean Distance Similarity': couple_dist[indices].ravel()})

    print('='*80, '\n', 'Headline Input: ',data['Title'][indices[0]], '\n')
    print('='*80, '\n', 'CountVectorizer Article Recommendations:')
    
    
    return df.iloc[1:,:]

# Apply CountVectorizer
headline_vectorizer = CountVectorizer()

# Set news article input
title = input('Input News Headline to generate list of similar articles:')
clear_output()

# Apply model
ind = df2[df2['Title']==title].index[0]
df_output = Euclidean_Distance_Based_Model(ind, 100)

# Print top 10 articles matching model
df_output.head(10)



 Headline Input:  50 Worst Habits For Belly Fat 

 CountVectorizer Article Recommendations:


Unnamed: 0,Headline,Category,Abstract,Euclidean Distance Similarity
1,How to deal with your health worries,health,One of the most stressful things in our modern...,3.316625
2,Health Benefits of Nutritional Yeast and How t...,health,"Vegans love the cheesy, umami flavor and there...",3.464102
3,Health Watch - Low Libido,health,Health Watch - Low Libido.,3.605551
4,How Easy Is It to Get Pneumonia?,health,Celebs such as Oprah Winfrey and Whoopi Goldbe...,3.605551
5,When I Had to Choose Between My Health and How...,health,I never thought my chronic illness would force...,3.605551
6,Health problems treated by acupuncture,health,The health benefits of acupuncture are backed ...,3.605551
7,I Have Anxiety and Here's How It Actually Help...,health,I was diagnosed with panic disorder 20 years a...,3.605551
8,Arthritis: Watch out for these symptoms,health,Characterized by inflammation in the joints or...,3.741657
9,How to know when it's time for new joints,health,Doctors say because the artificial joints comm...,3.741657
10,How nature saves us trillions of dollars in he...,health,Researchers at Griffith University in Australi...,3.741657


## Article Recommender: Bag-of-Words Model - TfidfVectorizer 

TfidfVectorizer tokenizes documents, learns the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

In [47]:
display(button)

def TFIDF_Based_Model(row_index, num_similar_items):
    cat = data['Category'][row_index]
    title = data['Title'][row_index]
    cat_data = data[data['Category']==cat]
 
    row_index2 = cat_data[cat_data['Title']==title].index
    headline_features = tfidf_headline_vectorizer.fit_transform(cat_data['Title'].values)
    couple_dist = pairwise_distances(headline_features,headline_features[row_index2])
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    
    df = pd.DataFrame({'Headline':df2[df2['Category']==cat]['Title'].values[indices],
                       'Category':cat_data['Category'].values[indices],
                       'Abstract':cat_data['Abstract'].values[indices],
                       'Euclidean Distance Similarity': couple_dist[indices].ravel()})
    
    print('='*80, '\n', 'Headline Input: ',data['Title'][indices[0]], '\n')
    print('='*80, '\n', 'TfidfVectorizer Article Recommendations:')
    
    return df.iloc[1:,:]

# Apply TfidfVectorizer
tfidf_headline_vectorizer = TfidfVectorizer(min_df=0)

# Set news article input
title=input('Input News Headline to generate list of similar articles:')
clear_output()

# Apply model
ind = df2[df2['Title']==title].index[0]
df_output = TFIDF_Based_Model(ind, 100)

# Print top 10 articles matching model
df_output.head(10)



 Headline Input:  50 Worst Habits For Belly Fat 

 TfidfVectorizer Article Recommendations:


Unnamed: 0,Headline,Category,Abstract,Euclidean Distance Similarity
1,"I Had 1 Cheat Day Once a Week For 2 Months, an...",health,Some experts and dieters agree that cheat meal...,1.204529
2,'How I Told My Kids About My Breast Cancer',health,I was scared that they would be scared. I was ...,1.253885
3,Pippa Middleton Takes Baby Son to Cranial Oste...,health,"Soon after Arthur was born last year, I heard ...",1.265457
4,How Listening to My Gut and a New Symptom Save...,health,A woman with chronic illness describes what ha...,1.274405
5,I Was 24 & Had Just Scored My Dream Job In Fas...,health,One in four adults in the U.S. are living with...,1.278005
6,How to Ensure Your Mental Health Remains a Pri...,health,"It's commonly referred to as ""the most wonderf...",1.280123
7,When I Had to Choose Between My Health and How...,health,I never thought my chronic illness would force...,1.283757
8,How To Reach Out If Your Friend Is Struggling ...,health,Reaching out to check in on someone who strugg...,1.284652
9,"My Doctors Told Me I Had IBS. 4 Years Later, I...",health,"One in three women has this disorder, and it o...",1.285335
10,I Have Anxiety and Here's How It Actually Help...,health,I was diagnosed with panic disorder 20 years a...,1.288284


-----
# 8.0 - Category Recommendations

### Examine dataframe: Top 5 rows

In [26]:
display(button)

# Examine dataframe: Top 5 rows
data.head()


ToggleButton(value=True, description='Hide code')

Unnamed: 0,News ID,Category,SubCategory,Title,Abstract
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth , Prince Charles , ...","Shop the notebooks, jackets, and more that the..."
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...
2,N61837,news,newsworld,The Cost Trump 's Aid Freeze Trenches Ukraine ...,Lt. Ivan Molchanets peeked over a parapet of s...
3,N53526,health,voices,I Was An NBA Wife . Here 's How It Affected My...,"I felt like I was a fraud, and being an NBA wi..."
4,N38324,health,medical,"How Get Rid Skin Tags , According Dermatologist","They seem harmless, but there's a very good re..."


### Test/train/split data

In [29]:
display(button)

# Set X, y values for test/train/split
X = data['Title'].values
y = data['Category'].values

# Train the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=55)

print("Test/train/split complete!")


ToggleButton(value=True, description='Hide code')

Test/train/split complete!


## TruncatedSVD and DecisionTreeClassifier

In [42]:
display(button)

# Define method to print report
def print_report(pipe1):
    y_pred = pipe1.predict(X_test)
    p=np.unique(y_test)
    report = metrics.classification_report(
        y_test, 
        y_pred,
        target_names=p)
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))
    

# Generate TruncatedSVD and Decision Tree Classifier pipeline
vec = TfidfVectorizer(min_df=4)
svd = TruncatedSVD(n_components=400, 
                   n_iter=8, 
                   random_state=42)
lsa = make_pipeline(vec, svd)
dtc = DecisionTreeClassifier()
pipe1 = make_pipeline(lsa, dtc)


# Fit the data
pipe1.fit(X_train, y_train)

# Calculate values
val = pipe1.score(X_test, y_test)
per = "{:.2%}".format(val)

# Calculate report
print("-"*75,"\n SVD and Decision Tree Classifier \n"+"-"*75+"\n")
print_report(pipe1)
print("")
print("accuracy full:", val)
print("accuracy pct :", per)


ToggleButton(value=True, description='Hide code')

--------------------------------------------------------------------------- 
 SVD and Decision Tree Classifier 
---------------------------------------------------------------------------

               precision    recall  f1-score   support

        autos       0.15      0.14      0.15       769
entertainment       0.10      0.08      0.09       286
      finance       0.16      0.15      0.15      1516
 foodanddrink       0.25      0.25      0.25      1230
       health       0.16      0.15      0.16       913
         kids       0.00      0.00      0.00         8
    lifestyle       0.14      0.14      0.14      1160
   middleeast       0.00      0.00      0.00         1
       movies       0.07      0.07      0.07       299
        music       0.05      0.06      0.06       334
         news       0.54      0.54      0.54      7476
       sports       0.68      0.68      0.68      6357
       travel       0.09      0.09      0.09      1120
           tv       0.04      0.05      

## TruncatedSVD and XBGClassifier

In [31]:
display(button)

# Define method to print report
def print_report(pipe2):
    y_pred = pipe2.predict(X_test)
    p=np.unique(y_test)
    report = metrics.classification_report(
        y_test, 
        y_pred,
        target_names=p)
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))


# Generate TruncatedSVD and XBGClassifier pipeline
vec = TfidfVectorizer(min_df=0)
svd = TruncatedSVD(n_components=10, 
                   n_iter=1, 
                   random_state=42)
lsa = make_pipeline(vec, svd)
xgb = XGBClassifier(verbosity=0)
pipe2 = make_pipeline(lsa, xgb)

# Fit the data
pipe2.fit(X_train, y_train)

# Calculate values
val = pipe2.score(X_test, y_test)
per = "{:.2%}".format(val)

# Generate report
print("-"*75,"\n TruncatedSVD and XGBClassifier \n"+"-"*75+"\n")
print_report(pipe2)
print("")
print("accuracy full:", val)
print("accuracy pct :", per)


ToggleButton(value=True, description='Hide code')





--------------------------------------------------------------------------- 
 TruncatedSVD and XGBClassifier 
---------------------------------------------------------------------------

               precision    recall  f1-score   support

        autos       0.27      0.07      0.11       769
entertainment       0.28      0.04      0.07       286
      finance       0.21      0.06      0.09      1516
 foodanddrink       0.30      0.30      0.30      1230
       health       0.24      0.11      0.15       913
         kids       0.00      0.00      0.00         8
    lifestyle       0.21      0.14      0.17      1160
   middleeast       0.00      0.00      0.00         1
       movies       0.12      0.01      0.02       299
        music       0.13      0.02      0.04       334
         news       0.52      0.80      0.63      7476
       sports       0.62      0.79      0.70      6357
       travel       0.16      0.03      0.05      1120
           tv       0.11      0.02      0.

## TruncatedSVD and LGBMClassifier

In [32]:
display(button)

# Define method to print report
def print_report(pipe3):
    y_pred = pipe3.predict(X_test)
    p=np.unique(y_test)
    report = metrics.classification_report(
        y_test, 
        y_pred,
    )
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))


# Generate TruncatedSVD and LGBMClassifier pipeline
vec = TfidfVectorizer(min_df=0)
svd = TruncatedSVD(n_components=100, 
                   n_iter=1, 
                   random_state=42)
lsa = make_pipeline(vec, svd)
lgm = LGBMClassifier()
pipe3 = make_pipeline(lsa, lgm)


# Fit the data
pipe3.fit(X_train, y_train)

# Calculate values
val = pipe3.score(X_test, y_test)
per = "{:.2%}".format(val)

# Generate report for pipeline using LGBMClassifier
print("-"*75,"\n TruncatedSVD and LGBMClassifier \n"+"-"*75+"\n")
print_report(pipe3)
print("")
print("accuracy full:", val)
print("accuracy pct :", per)


ToggleButton(value=True, description='Hide code')

--------------------------------------------------------------------------- 
 TruncatedSVD and LGBMClassifier 
---------------------------------------------------------------------------

               precision    recall  f1-score   support

        autos       0.22      0.15      0.18       769
entertainment       0.06      0.07      0.07       286
      finance       0.21      0.14      0.17      1516
 foodanddrink       0.32      0.27      0.29      1230
       health       0.23      0.18      0.20       913
         kids       0.00      0.00      0.00         8
    lifestyle       0.19      0.16      0.17      1160
   middleeast       0.00      0.00      0.00         1
       movies       0.05      0.08      0.06       299
        music       0.09      0.11      0.10       334
         news       0.51      0.63      0.56      7476
 northamerica       0.00      0.00      0.00         0
       sports       0.66      0.67      0.67      6357
       travel       0.12      0.07      0

## Generate Category Predictions

In [33]:
# Print value of title 
print(data['Title'][1])


50 Worst Habits For Belly Fat


In [34]:
display(button)

p=np.unique(y_test)

def print_prediction(doc):
    y_pred = pipe3.predict_proba([doc])[0]
    for target, prob in zip(p, y_pred):
        print("{:.3f} {}".format(prob, target))

# Set news article input
doc=input('Input news article title to generate predictions of Category:')
clear_output()


print("-"*75,"\n Category Predictions by Percentage: \n"+"-"*75+"\n")
print_prediction(doc)


--------------------------------------------------------------------------- 
 Category Predictions by Percentage: 
---------------------------------------------------------------------------

0.014 autos
0.003 entertainment
0.073 finance
0.357 foodanddrink
0.232 health
0.000 kids
0.104 lifestyle
0.000 middleeast
0.000 movies
0.006 music
0.092 news
0.000 sports
0.050 travel
0.048 tv
0.002 video
0.012 weather


# Text Explainer
Per the eli5 documentation, some pipelines are not supported by eli5 directly, but one can use eli5.lime.TextExplainer to debug the prediction in order to check what was important in the document and used to make this decision. This outputs a comprehensive report that lists information and scores related to the top processed features.

In [35]:
display(button)

# Apply TextExplainer to doc and pipe3

te = TextExplainer(random_state=50)
te.fit(doc, pipe3.predict_proba)
te.show_prediction()


ToggleButton(value=True, description='Hide code')

Contribution?,Feature
-0.849,<BIAS>
-3.782,Highlighted in text (sum)

Contribution?,Feature
-0.655,<BIAS>
-8.826,Highlighted in text (sum)

Contribution?,Feature
-0.953,<BIAS>
-1.65,Highlighted in text (sum)

Contribution?,Feature
-0.004,Highlighted in text (sum)
-1.096,<BIAS>

Contribution?,Feature
0.018,Highlighted in text (sum)
-0.963,<BIAS>

Contribution?,Feature
-0.738,<BIAS>
-1.32,Highlighted in text (sum)

Contribution?,Feature
-0.911,<BIAS>
-6.448,Highlighted in text (sum)

Contribution?,Feature
-0.716,<BIAS>
-7.432,Highlighted in text (sum)

Contribution?,Feature
-0.501,<BIAS>
-2.348,Highlighted in text (sum)

Contribution?,Feature
-0.947,<BIAS>
-1.883,Highlighted in text (sum)

Contribution?,Feature
-1.0,<BIAS>
-1.747,Highlighted in text (sum)

Contribution?,Feature
-0.635,<BIAS>
-7.918,Highlighted in text (sum)

Contribution?,Feature
-0.852,<BIAS>
-4.955,Highlighted in text (sum)

Contribution?,Feature
-0.306,<BIAS>
-4.966,Highlighted in text (sum)


# TextExplainer with Character Analyzer

In [36]:
display(button)

# Define method to print report
def print_report(pipe4):
    y_pred = pipe4.predict(X_test)
    p=np.unique(y_test)
    report = metrics.classification_report(
        y_test, 
        y_pred)
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))
    
# Generate text explainer pipeline
vec = TfidfVectorizer(min_df=0, 
                      analyzer='char',
                      ngram_range=(3,6))
svd = TruncatedSVD(n_components=100, 
                   n_iter=1, 
                   random_state=42)
lsa = make_pipeline(vec, svd)
lgm = LGBMClassifier()
pipe4 = make_pipeline(lsa, lgm)


# Fit the data
pipe4.fit(X_train, y_train)

# Calculate values
val = pipe4.score(X_test, y_test)
per = "{:.2%}".format(val)

# Generate report for pipeline using LGBMClassifier
print("-"*75,"\n Text Explainer with Character Analyzer \n"+"-"*75+"\n")
print_report(pipe4)
print("")
print("accuracy full:", val)
print("accuracy pct :", per)


ToggleButton(value=True, description='Hide code')

--------------------------------------------------------------------------- 
 Text Explainer with Character Analyzer 
---------------------------------------------------------------------------

               precision    recall  f1-score   support

        autos       0.15      0.15      0.15       769
entertainment       0.06      0.09      0.07       286
      finance       0.22      0.15      0.18      1516
 foodanddrink       0.31      0.30      0.30      1230
       health       0.21      0.19      0.20       913
         kids       0.00      0.00      0.00         8
    lifestyle       0.16      0.15      0.16      1160
   middleeast       0.00      0.00      0.00         1
       movies       0.06      0.11      0.07       299
        music       0.05      0.07      0.06       334
         news       0.50      0.50      0.50      7476
 northamerica       0.00      0.00      0.00         0
       sports       0.58      0.58      0.58      6357
       travel       0.12      0.10

In [37]:
display(button)

# Apply MaskingTextSample to doc
sampler = MaskingTextSampler(
    token_pattern='.',
    max_replace=3,        # by default all tokens are replaced; replace only a token at a given position.
    bow=False,
)
samples, similarity = sampler.sample_near(doc)
print(samples[0])



ToggleButton(value=True, description='Hide code')

50 Worst Habis For Belly Fat


In [38]:
display(button)

# Apply TextExplainer to Output

te = TextExplainer(char_based=True, 
                   sampler=sampler, 
                   random_state=42)
te.fit(doc, pipe4.predict_proba)
te.show_prediction()


ToggleButton(value=True, description='Hide code')

Contribution?,Feature
-0.088,<BIAS>
-3.127,Highlighted in text (sum)

Contribution?,Feature
-0.046,<BIAS>
-4.834,Highlighted in text (sum)

Contribution?,Feature
-0.055,<BIAS>
-3.447,Highlighted in text (sum)

Contribution?,Feature
-0.075,<BIAS>
-2.257,Highlighted in text (sum)

Contribution?,Feature
-0.13,<BIAS>
-0.777,Highlighted in text (sum)

Contribution?,Feature
-0.103,<BIAS>
-2.146,Highlighted in text (sum)

Contribution?,Feature
-0.045,<BIAS>
-6.561,Highlighted in text (sum)

Contribution?,Feature
-0.094,<BIAS>
-4.169,Highlighted in text (sum)

Contribution?,Feature
-0.068,<BIAS>
-2.252,Highlighted in text (sum)

Contribution?,Feature
-0.074,<BIAS>
-1.265,Highlighted in text (sum)

Contribution?,Feature
0.027,<BIAS>
-4.685,Highlighted in text (sum)

Contribution?,Feature
-0.062,<BIAS>
-5.13,Highlighted in text (sum)

Contribution?,Feature
-0.097,<BIAS>
-3.257,Highlighted in text (sum)

Contribution?,Feature
-0.096,<BIAS>
-4.571,Highlighted in text (sum)
