# Simple Text Analytics with Pandas


For this exercise we use data from the NYC open data project:

https://opendata.cityofnewyork.us

In particular, we use a dataset containing information about OATH Hearings Division Case Status from 2016. In the following link you can find updated data as well as a variable description:

https://data.ny.gov/City-Government/OATH-Hearings-Division-Case-Status/jz4z-kudi


For this notebook we will use these variables:

- **ticketnumber:** Bill id
- **violationdate:** Date of the bill
- **issuingagency:** Who is issuing the bill
- **respondentlastname:** Who is the respondent of the bill
- **charge1codedescription:** Bill text description
- **charge1infractionamount:** Amount


Our objective will be to create a word cloud to represent word importance in a visual manner. See https://en.wikipedia.org/wiki/Tag_cloud for more details

<img src="https://upload.wikimedia.org/wikipedia/commons/9/9e/Foundation-l_word_cloud_without_headers_and_quotes.png" alt="drawing" width="600"/>


First we need to install a python library to generate this representation easily

In [None]:
pip install wordcloud

## Data loading, column selection and cleaning

We begin by loading the data

In [None]:
filename= "unpaid_bills_nyc_2016.csv"
data_raw = pd.read_csv(filename)

We select the columns of interest

In [None]:
cols = ["ticketnumber", "violationdate", "issuingagency", "respondentlastname",
        "charge1codedescription",  "charge1infractionamount"]
data = data_raw[cols]

We drop null values and filter the data to use only those infractions with a fraction above 0€

In [None]:
data = data.dropna()
data = data[data["charge1infractionamount"]>0]

## Let's inspect the data

In [None]:
data.shape

In [None]:
data

## Create a word cloud from a DataFrame column

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def show_wordcloud(data, title = None):
    
    words = list(data)

    stoplist = set('for a of the and to in from than with'.split())
    
    words_clean = [word.lower() for word in words if word.lower() not in stoplist]
    words_clean = [word for word in words_clean if len(word)>2]
    wordcloud = WordCloud(
        background_color='white',
        max_words=100,
        max_font_size=40, 
        scale=3,
        random_state=42
    ).generate(' '.join(words))

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

## Let's use it with the bill description column

In [None]:
show_wordcloud(data["charge1codedescription"])

## Basic descriptive statistics

In [None]:
total_bills = data["charge1infractionamount"].sum()
num_bills = data["charge1infractionamount"].count()
avg_bills = data["charge1infractionamount"].mean()

f'There are {num_bills} bills, with a total sum equal to {total_bills} and an average amount of {avg_bills} '


## Let's find someone famous

For example ...
<img src="https://upload.wikimedia.org/wikipedia/commons/5/56/Donald_Trump_official_portrait.jpg" alt="drawing" width="200"/>

Find all Donald Trump bills

In [None]:
trump_bills = <FILL IN>

In [None]:
trump_bills.shape

In [None]:
trump_total_bills = trump_bills["charge1infractionamount"].sum()
trump_num_bills = trump_bills["charge1infractionamount"].count()
trump_avg_bills = trump_bills["charge1infractionamount"].mean()

f'D. Trump has {trump_num_bills} bills, with a total sum equal to {trump_total_bills} and an average amount of {trump_avg_bills} '



## Let's see the ones with the largest amounts

In [None]:
<FILL IN>

## Use loc to inspect bill with id 9175

In [None]:
<FILL IN>

In [None]:
<FILL IN>["charge1codedescription"]

## Print the complete description of all D. Trump's bills

In [None]:
for s in trump_bills["charge1codedescription"]:
    print(s)

## Create the word cloud

In [None]:
<FILL IN>