# NICAR Workshop: Machine Learning and NLP

By Jeff Kao, ProPublica

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

## Introduction

Wow, that was a lot of dataviz and ML! We're finally on our way to using that for NLP.

I'll be posting these notebooks on [propublica's github](https://github.com/propublica) for your future use & play.

### What this is.

How can we use ML/NLP to help us break down and investigate larger datasets? How can we gain an intuitive sense of 'how it works' without having to dig deeply into the math?

Instead of learning it from the ground (math) up, let's get a top-down understanding with the help of data visualization.

Goals for this session:
* gain an intuitive understanding for unsupervised machine learning
* connect machine learning to NLP
* learn to incorporate these techniques into your investigations

### What this is not.

Machine learning is not magic. While these are useful tools in a data journalist's repertoire, they don't replace what we are already good at: understanding the context & real world interactions underlying the data. It also doesn't replace the 'traditional' statistical techniques we already have.

This session will NOT be:
* overly math-y (although a deeper understanding of the math helps you get better results)
* about algorithmic bias (although one should always be aware of the gaps between the real world and its representation in data)
* about supervised machine learning (there are already a ton of online resources dedicated to that)
* about when machine learning is useful (although that is great to learn too)


In [None]:
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 500

In [None]:
# !unzip ./data/ira_tweets_csv_hashed.zip -d ./data/

In [None]:
# !unzip ./data/ira_users_csv_hashed.zip -d ./data/

In [None]:
df_users = pd.read_csv('./data/ira_users_csv_hashed.csv')

In [None]:
df_users

In [None]:
df_users_eng = df_users[df_users['account_language'] == 'en']

In [None]:
%%time
df_all = pd.read_csv('./data/ira_tweets_csv_hashed.csv')

In [None]:
df_all.head()

In [None]:
df_all.columns

In [None]:
df_all['tweet_language'].value_counts()

In [None]:
%%time
df_eng = df_all[df_all['tweet_language'] == 'en']
df_eng = df_eng[df_eng['userid'].isin(df_users_eng['userid'])]

In [None]:
df_eng_red_cols = df_eng[['tweetid', 'userid',
       'tweet_text', 'tweet_time', 'tweet_client_name', 
       'in_reply_to_tweetid', 'in_reply_to_userid', 'quoted_tweet_tweetid', 
       'quote_count', 'reply_count', 'like_count', 'retweet_count',
       'is_retweet', 'retweet_userid', 'retweet_tweetid']]

In [None]:
df_eng_by_user = \
(df_eng_red_cols[['userid','tweet_text']]
 .groupby('userid')
 .agg({'userid': 'first', 'tweet_text': lambda x: ' '.join(x)})
 .set_index('userid'))

In [None]:
df_eng_by_user

In [None]:
df = df_eng_by_user

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
count_vect = CountVectorizer(stop_words='english')

In [None]:
%%time
X_counts = count_vect.fit_transform(df['tweet_text'])

In [None]:
df_counts = (pd.DataFrame
             .from_dict(count_vect.vocabulary_, orient='index')
             .rename(columns={0: 'index'})
             .sort_values(by='index'))
df_counts['count'] = np.array(X_counts.sum(axis=0)).flatten()
df_counts = df_counts.drop(columns=['index'])
fig, ax = plt.subplots(figsize=(15,15))
ax.set_title('Top 80 tokens')
df_counts.sort_values('count', ascending=False)[:80].sort_values('count').plot.barh(ax=ax);

In [None]:
tfidf_vect = TfidfVectorizer(stop_words='english')

In [None]:
%%time
X_tfidfs = tfidf_vect.fit_transform(df['tweet_text'])

In [None]:
X_tfidfs.shape

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

In [None]:
svd = TruncatedSVD(n_components=300, random_state=42)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

In [None]:
%%time
X_lsa = lsa.fit_transform(X_tfidfs)

In [None]:
X_lsa.shape

In [None]:
explained_variance = svd.explained_variance_ratio_.sum()
print(f"Explained variance of the SVD step: {int(explained_variance * 100)}%")

In [None]:
X_lsa.shape

In [None]:
import umap

In [None]:
%%time

# Separate out the features
x = X_lsa

reducer = umap.UMAP()
um = reducer.fit_transform(x)
df_um = pd.DataFrame(
    data = um,
    columns = ['um1', 'um2']
)

In [None]:
df_um.index = df_eng_by_user.index

In [None]:
df_um

In [None]:
%matplotlib inline
fig, ax = plt.subplots(
    nrows=1,
    ncols=1,
    figsize=(15,15)
)
sns.scatterplot(data=df_um, y='um2', x='um1', alpha=0.1, ax=ax);

In [None]:
%%time

# Separate out the features
x = X_tfidfs

reducer = umap.UMAP()
um = reducer.fit_transform(x)
df_um = pd.DataFrame(
    data = um,
    columns = ['um1', 'um2']
)

In [None]:
%matplotlib inline
fig, ax = plt.subplots(
    nrows=1,
    ncols=1,
    figsize=(15,15)
)
sns.scatterplot(data=df_um, y='um2', x='um1', alpha=0.1, ax=ax);

In [None]:
# now we have documents in space -- we can visualize or use clustering algos!

## Machine Learning: NLP and Unsupervised learning

Supervised learning requires labelled data, but often when we are doing investigations, we don't yet know what we are looking for. Unsupervised learning helps our analysis because we let the numbers take us in the right direction.
* K-means
* HDBSCAN
* IsolationForest

## Twitter Dataset

## Data Exploration

## Natural Language Processing: Turning words to numbers

* bag-of-words (discuss bag-of-characters and n-grams)
* tfidf
* LSI
* we won't have time for:
* word2vec and other deep-learning based language models

## Natural Language Processing: Clustering and outlier detection

## Sample Analysis