# NLP Analysis Using Scattertext and SpaCy

Following this Medium blog: https://towardsdatascience.com/analyzing-yelp-dataset-with-scattertext-spacy-82ea8bb7a60e

- SpaCy: https://spacy.io/usage
- Scattertext: https://github.com/JasonKessler/scattertext#installation

Exploring the Disaster Tweets data from this Kaggle competition: https://www.kaggle.com/c/nlp-getting-started/overview

In [32]:
# Imports
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import spacy
import scattertext

from nltk.corpus import stopwords
import string

In [2]:
# If you're just now using spacy for the first time,
# you'll need to download the English model:
# !python -m spacy download en_core_web_sm

In [3]:
# Reading in training data for the Disaster Tweets competition
df = pd.read_csv('data/train.csv')

In [4]:
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [7]:
# Making the target readable, since it doesn't require number inputs
target_map = {1: "IsDisaster", 0: "NotDisaster"}
df["target"] = [target_map[item] for item in df["target"]]

In [8]:
df['target'].value_counts()

NotDisaster    4342
IsDisaster     3271
Name: target, dtype: int64

In [6]:
nlp = spacy.load('en_core_web_sm')

### Aggregating Stopwords List

In [40]:
# Default stopwords list from spacy
disaster_stopwords = nlp.Defaults.stop_words

In [41]:
# Adding stopwords from nltk's list
stopWords = set(stopwords.words('english'))
disaster_stopwords |= stopWords

In [42]:
# Adding string punctuation
disaster_stopwords |= set(string.punctuation)

In [63]:
# Last but not least, an additional text doc of stopwords
with open('stopwords.txt', 'r') as f:
    str_f = f.read()
    set_stopwords = set(str_f.split('\n'))
disaster_stopwords |= set_stopwords

In [49]:
def term_freq(df):
    '''
    Function from Gyhou: https://github.com/gyhou/yelp_dataset
    
    Inputs: 
    
    df - pandas dataframe, with a category column and as well as
            a column with texts to explore
            
    Outputs:
    
    corpus - result of scattertext
    df_is - dataframe result just for IsDisaster
    df_not - dataframe result just for NotDisaster
    '''
    corpus = (scattertext.CorpusFromPandas(df,
                                           category_col='target',
                                           text_col='text',
                                           nlp=nlp)
              .build()
              .remove_terms(disaster_stopwords, ignore_absences=True)
              )
    df = corpus.get_term_freq_df()
    df['IsDisaster_Score'] = corpus.get_scaled_f_scores("IsDisaster")
    df['NotDisaster_Score'] = corpus.get_scaled_f_scores("NotDisaster")
    df['IsDisaster_Score'] = round(df['IsDisaster_Score'], 2)
    df['NotDisaster_Score'] = round(df['NotDisaster_Score'], 2)

    df_is = df.sort_values(by='IsDisaster freq',
                           ascending=False).reset_index()
    df_isnot = df.sort_values(by='NotDisaster freq',
                              ascending=False).reset_index()
    return corpus, df_is, df_isnot

In [50]:
corpus, df_is, df_isnot = term_freq(df)

In [62]:
df_is.head(20)

Unnamed: 0,term,IsDisaster freq,NotDisaster freq,IsDisaster_Score,NotDisaster_Score
0,fire,178,72,0.95,0.05
1,news,134,57,0.95,0.05
2,disaster,119,36,0.96,0.04
3,california,115,6,0.99,0.01
4,suicide,112,7,0.99,0.01
5,police,109,33,0.96,0.04
6,people,105,92,0.9,0.1
7,killed,95,4,1.0,0.0
8,hiroshima,90,1,1.0,0.0
9,storm,89,32,0.96,0.04


In [61]:
df_isnot.head(20)

Unnamed: 0,term,IsDisaster freq,NotDisaster freq,IsDisaster_Score,NotDisaster_Score
0,new,56,168,0.07,0.93
1,body,15,115,0.03,0.97
2,video,69,96,0.84,0.16
3,people,105,92,0.9,0.1
4,love,11,90,0.02,0.98
5,time,31,84,0.07,0.93
6,emergency,77,81,0.88,0.12
7,fire,178,72,0.95,0.05
8,let,13,70,0.04,0.96
9,good,20,67,0.06,0.94


In [53]:
html = scattertext.produce_scattertext_explorer(
    corpus,
    category='IsDisaster',
    category_name='IsDisaster',
    not_category_name='NotDisaster',
    width_in_pixels=1000
)

In [54]:
html_file_name = "Disaster-Tweets-Scattertext.html"
open(html_file_name, 'wb').write(html.encode('utf-8'))

1688883

## Next Steps

- remove digits and potentially do some level of cleaning before the bi-grams are created