<img align="left" width="200" src="./img/northwestern.png">

# 5. Analysis

## Telling a story with data

There are thousands of entities listed in the report for the Named Entity Recognizer. Data includes names, dates, money, and locations. With the location data available it is possible to "map" the locations mentioned in the Northwestern Alumni News.

Pandas is a tool that allows for working with csv tables. Since the Named Entity Recognizer is already a table, Pandas is a great tool to use for this data.


First, I downloaded the entities file from the HTRC and added it to a folder in this directory. I called this folder data and the file "entities.csv"

In [1]:
#This code imports pandas and creates a dataframe based on the csv of entities. A dataframe is a Pandas table.
import pandas as pd
df = pd.read_csv('./data/entities.csv')
# This displays the first ten lines only
df.head(10)

Unnamed: 0,vol_id,page_seq,entity,type
0,ien.35556027642412,2,EVANSTON ILLINOIS,LOCATION
1,ien.35556027642412,7,Dental School,ORGANIZATION
2,ien.35556027642412,7,Northwestern,LOCATION
3,ien.35556027642412,7,"October 15, 1913",DATE
4,ien.35556027642412,7,University,ORGANIZATION
5,ien.35556027642412,7,College of Liberal Arts,ORGANIZATION
6,ien.35556027642412,7,University,ORGANIZATION
7,ien.35556027642412,7,Lake Forest,LOCATION
8,ien.35556027642412,7,"October 15, 1914",DATE
9,ien.35556027642412,7,University,ORGANIZATION


In [2]:
#There are so many Pandas tools. Pandas is very worthwhile to learn. This function shows the mean value for the page_seq column, for example:
df['page_seq'].mean()

164.8328783494422

Not terribly useful data right now, but it is great to explore the different functions offered by Pandas.

In [3]:
# I am interested in doing something with the location data. I want to normalize the data a little before it is counted. I lowercase the entities here:
df['entity'] = df['entity'].str.lower()
df

Unnamed: 0,vol_id,page_seq,entity,type
0,ien.35556027642412,2,evanston illinois,LOCATION
1,ien.35556027642412,7,dental school,ORGANIZATION
2,ien.35556027642412,7,northwestern,LOCATION
3,ien.35556027642412,7,"october 15, 1913",DATE
4,ien.35556027642412,7,university,ORGANIZATION
...,...,...,...,...
1233372,ien.35556027642651,230,his,PERSON
1233373,ien.35556027642651,230,her,PERSON
1233374,ien.35556027642651,230,american osteo¬ pathic hospital association,ORGANIZATION
1233375,ien.35556027642651,230,1969,DATE


In [4]:
#This limits the dataframe to just displaying locations 
df2 = df[df['type'].str.match('LOCATION')]
#I only want to see the top 100 locations, so this limits the result to the top 100 rows.
df2.head(100)

Unnamed: 0,vol_id,page_seq,entity,type
0,ien.35556027642412,2,evanston illinois,LOCATION
2,ien.35556027642412,7,northwestern,LOCATION
7,ien.35556027642412,7,lake forest,LOCATION
14,ien.35556027642412,7,lake michigan,LOCATION
26,ien.35556027642412,8,virginia,LOCATION
...,...,...,...,...
457,ien.35556027642412,25,evanston,LOCATION
460,ien.35556027642412,25,evanston,LOCATION
461,ien.35556027642412,25,chicago,LOCATION
465,ien.35556027642412,25,chicago,LOCATION


In [5]:
df3 = df2['entity']
#This counts the number of times each location is mentioned and sorts in descending order.
df4 = df3.value_counts().head(100)
df4

chicago      24427
evanston     14830
illinois      9794
new york      4344
michigan      3404
             ...  
dallas         355
phoenix        349
russia         347
mexico         334
san diego      331
Name: entity, Length: 100, dtype: int64

In [None]:
#Finally, in order to work with the data in Google Sheets it needs to be a csv
df4.to_csv('./data/top-100-locations.csv')

I cleaned up the names of the locations in OpenRefine (HIGHLY recommend learning more about OpenRefine!). Once I did that, I added the cleaned up chart to Google Sheets. There is a built in mapping tool in Sheets, so all I had to do was insert a chart > map. Now, the top 100 locations mentioned in the Alumni magazine are mapped. The more a location is mentioned, the larger the size of the dot. 

![locations mentioned in the collection](./img/map.png)

## Where to go from here

### Practice using HathiTrust Research Center and other out-of-the-box text analysis tools

In addition to HTRC:
- Constellate https://constellate.org/ 
- Voyant https://voyant-tools.org/

### Build skills in R or Python

The NU statistics department has a page with resources. It includes help in deciding between R and Python.
https://statistics.northwestern.edu/undergraduate/r_python_resources.html