# Text Analysis, Natural Language Processing, and Social Media Data in Python

<a href="http://creativecommons.org/licenses/by-nc/4.0/" rel="license"><img style="border-width: 0;" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" alt="Creative Commons License" /></a>
This tutorial is licensed under a <a href="http://creativecommons.org/licenses/by-nc/4.0/" rel="license">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

# Load Data As Pandas DataFrame

In [None]:
# import pandas
import pandas as pd

# load data as pandas dataframe from file
irads = pd.read_csv("index_irads.csv")

irads

In [None]:
# import pandas
import pandas as pd

# load data as pandas dataframe from URL
irads = pd.read_csv("https://raw.githubusercontent.com/kwaldenphd/social-media-targeted-advertising/main/index_irads.csv")

irads

# Exploratory Data Analysis in Pandas

In [None]:
# show data types
irads.dtypes

In [None]:
# show dataframe info
irads.info()

In [None]:
# show summary statistics
irads.describe()

In [None]:
# sort ad vaues by cost (descending sort)

irads.sort_values(by=["cost"], ascending=False).head()

In [None]:
# sort by number of impressions, descending

irads.sort_values(by="impressions", ascending=False).head()

In [None]:
# sort by number of clicks, descending

irads.sort_values(by="clicks", ascending=False).head()

In [None]:
# sort by number of impressions and cost (descending)

irads.sort_values(by=['clicks', 'cost'], ascending=False).head()

In [None]:
# group titles based on exclusion category 

irads[['title', 'exclude']].groupby('exclude').head()

In [None]:
# number of ads by exclusion category

irads['exclude'].value_counts()

In [None]:
# number of ads by age range

irads['age'].value_counts()

For more options on interacting with data in a `DataFrame`: https://github.com/kwaldenphd/pandas-machine-learning-intro#interacting-with-a-dataframe

Other datasets you could explore:
- [`interests.csv`](https://raw.githubusercontent.com/umd-mith/irads/master/analysis/interests.csv) (money, clicks, impressions, number of ads, interest)
- [`people_who_match.csv`](https://raw.githubusercontent.com/umd-mith/irads/master/analysis/people_who_match.csv) (money, clicks, impressions, number of ads, people type who match

# Visualizing This Data

In [None]:
# import pandas
import pandas as pd

# load data

interests = pd.read_csv("https://raw.githubusercontent.com/umd-mith/irads/master/analysis/interests.csv")

interests

In [None]:
# quick visual check of the data using pandas built-in plotting function

# import matplotlib
import matplotlib.pyplot as plt

# generate plot
interests.plot()

# show plot
plt.show()

In [None]:
# customize plot showing number of ads by cost

interests.plot.scatter(x='money (RUB)', y='ads', alpha=0.5)

In [None]:
# create new data frame for ads with over 100000 clicks

top_interests = interests.loc[interests['clicks']>100000]

top_interests

In [None]:
# plot number of impressions by interest
top_interests.plot.barh(x='interest', y='impressions', alpha=0.5)

For more on plotting data in a `dataframe`: https://github.com/kwaldenphd/more-with-matplotlib

# From Data Frame to Text File

In [None]:
# test for not null values

irads[irads["title"].notna()]

In [None]:
# create data frame from not null fields
text = irads[irads["description"].notna()]

text = text["description"]

text.head()

In [None]:
# write ad text column to txt file
text.to_csv('irads_text.txt', index=False)

# Text Analysis Using Voyant Tools

<a href="http://voyant-tools.org/">Voyant Tools</a> is an open-source web application developed by Stéfan Sinclair and Geoffrey Rockwell in 2003, with later contributions added by Andrew MacDonald, Cyril Briquet, Lisa Goddard, and Mark Turcato. While Voyant is one of the leading robust web-based textual analysis interfaces, it grew out of existing text analysis tools like HyperPo, Tapoware, and TACT. Voyant also offers <a href="https://github.com/sgsinclair/Voyant">open-source code</a> that can be used to deploy the program on a server. Voyant users can upload text files from their computer, link to online text sources, or scrape the text off a webpage for analysis and visualization. Unlike more advanced, programming-oriented textual analysis programs like R and R Studio, Voyant gives users access to a range statistical analysis and visualization features without requiring significant technical knowledge.

Download the newly-created text file.

Open a web browser (preferably Firefox or Chrome) and navigate to the <a href="http://voyant-tools.org/">Voyant Tools homepage</a>.

<p align="center"><a href="https://github.com/kwaldenphd/Voyant-tutorial/blob/master/screenshots/Capture_1.PNG?raw=true"><img class="aligncenter size-large wp-image-549" src="https://github.com/kwaldenphd/Voyant-tutorial/blob/master/screenshots/Capture_1.PNG?raw=true" alt="" width="676" height="523" /></a></p>

Upload the file and click Reveal.

<p align="center"><a href="https://github.com/kwaldenphd/Voyant-tutorial/blob/master/screenshots/Capture_2.PNG?raw=true"><img class="aligncenter size-large wp-image-550" src="https://github.com/kwaldenphd/Voyant-tutorial/blob/master/screenshots/Capture_2.PNG?raw=true" alt="" width="676" height="355" /></a></p>

Once a text or corpus has been uploaded, Voyant moves into its ‘default skin,’ or primary editing environment.

For more on Voyant's interface and functionality: https://github.com/kwaldenphd/Voyant-tutorial/tree/SPN-285#editing-in-voyant

# Natural Language Processing Using NLTK

"The Natural Language Toolkit (NLTK) is a collection of reusable Python tools (also known as a Python library that help researchers apply a set of computational methods to texts. The tools range from methods of breaking up text into smaller pieces, to identifying whether a word belongs in a given language, to sample texts that researchers can use for training and development purposes (such as the complete text of Moby Dick)." [Zoë Wilkinson Saldaña, "Sentiment Analysis for Exploratory Data Analysis," The Programming Historian 7 (2018), https://doi.org/10.46430/phen0079.]

For more info on NLTK: https://www.nltk.org/

## Loading NLTK

In [None]:
# load nltk
import sys 
!{sys.executable} -m pip install --user -U nltk

In [None]:
# import nltk
import nltk
nltk.download('punkt')

In [None]:
# import additional nltk components
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from nltk.tag import pos_tag

## Different Options for Tokenizing Using NLTK

In Natural Language Processing (NLP), "tokenizing" refers to the process of breaking a large text into smaller units (words or sentences) known as tokens.

NLTK includes a wide range of options for tokenizing text data.

In [None]:
# tokenize using word_tokenize
from nltk.tokenize import word_tokenize

text_str = str(text)

# text_str
    
word_tokenize(text_str)

In [None]:
# tokenize using TreebankWordTokenizer
from nltk.tokenize import TreebankWordTokenizer

text_str = str(text)

# text_str
    
tokenizer = TreebankWordTokenizer()

tokenizer.tokenize(text_str)

In [None]:
# tokenize using WordPunctTokenizer 
from nltk.tokenize import WordPunctTokenizer

text_str = str(text)

# text_str
    
tokenizer = WordPunctTokenizer()

tokenizer.tokenize(text_str)

In [None]:
#tokenize using RegexpTokenizer
from nltk.tokenize import RegexpTokenizer

text_str = str(text)

# text_str
    
tokenizer = RegexpTokenizer("[\w']+")

regex_words = tokenizer.tokenize(text_str)

regex_words

We can use the most effective tokenizing method for this data in combination with a few other data wrangling steps to output a unique list of words.

In [None]:
# tokenize using word_tokenize
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text_str)

# convert to lower case
tokens = [w.lower() for w in tokens]

# remove punctuation/special characters
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

# remove non-text content
words = [word for word in stripped if word.isalpha()]

# filter out stop words
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

words = [w for w in words if not w in stop_words]

# removes words with fewer than 3 characters
# words = [word for word in words if len(word) > 3]

# output cleaned list of words
print(words)

We can then take that list of words and plot term frequency and distribution.

In [None]:
# import nltk components
import nltk
from nltk.corpus import webtext
from nltk.probability import FreqDist

nltk.download('webtext')

# analyze term frequency/distribution
data_analysis = nltk.FreqDist(words)

data_analysis

In [None]:
# plot term frequency/distribution for all terms
data_analysis.plot()

In [None]:
# show term frequency/distribution for top 10 terms
for word, frequency in data_analysis.most_common(10):
    print(u'{};{}'.format(word, frequency))

# Putting it all together

Discussion questions:
- What kinds of things were you interested in exploring via this dataset?
- How did you approach those questions using computational methods?
  * This could focus on what you did in Python using `pandas` and/or `nltk`
  * You could also think about insights gained from a graphical user interface programs like Voyant Tools
- What kinds of insights were you able to determine?
- How did interacting with this data using computational methods shape your understanding of the data?
- Where would you go next?
- How are you thinking about race and surveillance after engaging with this data?
- Other questions/thoughts/observations

# Lab Notebook Questions

The lab notebook consists of a narrative that documents and describes your experience working through this lab.

You can respond to/engage with other discussion questions included in the lab procedure.

But specific questions for the lab notebook (from the "Putting It All Together" section):
- What kinds of things were you interested in exploring via this dataset?
- How did you approach those questions using computational methods?
  * This could focus on what you did in Python using `pandas` and/or `nltk`
  * You could also think about insights gained from a graphical user interface programs like Voyant Tools
- What kinds of insights were you able to determine?
- How did interacting with this data using computational methods shape your understanding of the data?
- Where would you go next?
- How are you thinking about race and surveillance after engaging with this data?
- Other questions/thoughts/observations

I encourage folks to include code + screenshots as part of that narrative.