# NLP: pandas basics

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import pandas as pd

### Better display format

In [None]:
import nltk
nltk.download('inaugural')

from nltk.corpus import inaugural

In [None]:
files = inaugural.fileids()
texts = [inaugural.raw(file) for file in files]

In [None]:
print(len(files))
files[-10:] # print last few files

In [None]:
inaugural.raw(files[-1])

In [None]:
# ignore the last 4 characters ".txt"
years = [file[:-4].split("-")[0] for file in files]
presidents = [file[:-4].split("-")[1] for file in files]
print(years[-10:]) # print last few files
print(presidents[-10:]) # print last few files

In [None]:
df = pd.DataFrame({
    "year": years,
    "president": presidents,
    "file": files,
    "text": texts
})
df.set_index("year", inplace=True)
df.tail() # print last few files

In [None]:
df["length"] = df.text.str.split().str.len()
df.tail() # print last few files

In [None]:
df.length.plot(hover_data={'president': df.president}, backend='plotly')

In [None]:
i = df.length.argmax()
print(df.iloc[i])

### DataFrame

The pandas package can be viewed as a powerful Excel.  It loses the graphic user interface, but its much more flexible and efficient --- which is a fair compromise.  In pandas, it uses `Series` for list data and `DataFrame` for table data.  

A `Series` is a list with index.

In [None]:
years = list(range(1911, 2030))
y2z = ["rat", "ox", "tiger", "rabbit", "dragon", "snake", "horse", "goat", "monkey", "rooster", "dog", "pig"]
zodiac = [y2z[(y - 1912) % 12] for y in years]
Z = pd.Series(zodiac, index=years)
Z

A `DataFrame` is 

- a dictionary of `Series` (columns), or 
- a list of lists (rows).

Recall the example.  

| student \ subject | A | B | C | D | E | decision | comments |
|----|----|----|----|----|----|----|----|
| 1 | 10 | 10 | 10 | 10 | 10 | accept | good |
| 2 | 10 | 10 | 10 | 10 | 0 | accept | so so |
| 3 | 0 | 0 | 15 | 0 | 0 | decline | need improvement |

In [None]:
cht = [10, 10, 0]
eng = [10, 10, 15]
math = [10, 10, 15]
nsci = [10, 10, 0]
ssci = [10, 0, 0]
df = pd.DataFrame({
    "Chinese": cht, 
    "English": eng, 
    "Math": math, 
    "N. Science": nsci, 
    "S. Science": ssci
})
df

In [None]:
arr = [[10, 10, 10, 10, 10], 
       [10, 10, 10, 10, 0], 
       [0, 0, 15, 0, 0]]
df = pd.DataFrame(arr)
df

### Index and columns

The names of the rows are stored in `df.index` , while the names of the columns are stored in `df.columns' .

In [None]:
arr = [[10, 10, 10, 10, 10], 
       [10, 10, 10, 10, 0], 
       [0, 0, 15, 0, 0]]
df = pd.DataFrame(arr)
df.index = ["Amy", "Bill", "Charles"]
df.columns =  ["Chinese", "English", "Math", "N. Science", "S. Science"]
df

Adding a new column is easy.

In [None]:
df["decision"] = ["accept", "accept", "decline"]
df["comments"] = ["good", "so so", "need improvement"]
df

### Selection and slicing

In [None]:
arr = [[10, 10, 10, 10, 10], 
       [10, 10, 10, 10, 0], 
       [0, 0, 15, 0, 0]]
df = pd.DataFrame(arr)
df.index = ["Amy", "Bill", "Charles"]
df.columns =  ["Chinese", "English", "Math", "N. Science", "S. Science"]
df["decision"] = ["accept", "accept", "decline"]
df["comments"] = ["good", "so so", "need improvement"]
df

Each row or column has a numerical index and a name.  For the numerical index, use `df.iloc` to select the entry, while for the name, use `df.loc` instead.

In [None]:
df.loc['Bill']

In [None]:
df.iloc[1]

The idea of slicing in NumPy also works in pandas.  In addition, it allows you to slice by names.

In [None]:
df.loc[:, "Chinese":"S. Science"]

Instead of `df.iloc[:,i]` , getting a column is easy.

In [None]:
df["decision"]

In [None]:
df.decision

### Groupby and apply

In [None]:
arr = [[10, 10, 10, 10, 10], 
       [10, 10, 10, 10, 0], 
       [0, 0, 15, 0, 0]]
df = pd.DataFrame(arr)
df.index = ["Amy", "Bill", "Charles"]
df.columns =  ["Chinese", "English", "Math", "N. Science", "S. Science"]
df["decision"] = ["accept", "accept", "decline"]
df["comments"] = ["good", "so so", "need improvement"]
df

There are several ways to manipulate the data to extract new features.  

In [None]:
df["total"] = df.loc[:,"Chinese":"S. Science"].sum(axis=1)
df

In [None]:
df["w. total"] = df["Chinese"] + 2*df["English"] + 2*df["Math"] + 2*df["N. Science"]
df

If there are no appropriate built-in function for your purpose or the function is too complicated, you may use `apply` to apply a function to the data.

In [None]:
df["pre-decision"] = df["w. total"].apply(lambda k: "accept" if k >= 60 else "decline")
df

Lastly, you may group the data by the values of some column and get collective information.  

    groupby = split + apply + combine

In [None]:
df.loc[:,"Chinese":"decision"].groupby("decision").mean()

### NLP task: find themes in each centrury

Let's try to find the most frequent words in the inaugural addresses in each period of times.

In [None]:
import nltk
nltk.download('inaugural')

from nltk.corpus import inaugural

files = inaugural.fileids()
texts = [inaugural.raw(file) for file in files]
years = [file[:-4].split("-")[0] for file in files]
presidents = [file[:-4].split("-")[1] for file in files]
df = pd.DataFrame({
    "year": years,
    "president": presidents,
    "file": files,
    "text": texts
})
df.set_index("year", inplace=True)
df.tail() # print last few files

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer(stop_words='english')
X = cvec.fit_transform(df.text).toarray()
X.shape

In [None]:
keywords_indices = X.argpartition(-5, axis=1)[:,-5:]
keywords_indices[-5:,:] # print last few files

In [None]:
keywords = cvec.get_feature_names_out()[keywords_indices]
keywords_list = [list(k) for k in keywords]
df["keywords"] = keywords_list
df.tail() # print last few files

In [None]:
df["five-year"] = df.index.to_series().astype(int) // 10 * 10
df.tail() # print last few files

In [None]:
df.loc[:, "keywords":"five-year"].groupby("five-year").sum()

### Further reading

- [_Python Data Science Handbook_](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas