In [1]:
import pandas as pd

checkouts_df = pd.read_csv('seattle_checkouts.csv')
checkouts2020_df = pd.read_csv('seattle_checkouts_2020.csv')

In [None]:
checkouts2020_df

# Plotting charts

See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html for parameters you can use to plot charts. 

Summary of the following:

- Charts should have a title and axis labels. Use the parameters `title`, `xlabel`, and `ylabel` to specify these.
- You can rotate the x-axis labels using `rot=0.0`.
- There are ways to select just the categories that you are interested in, e.g. `checkout_types_per_month_df[['BOOK','AUDIOBOOK', 'EBOOK']]`. You should justify why you're only looking at those categories, though. 
- Choose a plot that best matches your data and your intentions. For example, pie plots are appropriate for some data, but not for others (e.g. if you have a lot of categories, it becomes difficult to distinguish proportions in a pie plot). Line plots are good for tracking changes across time. 

In [None]:
checkout_types = checkouts_df['MaterialType'].value_counts()
checkout_types

In [None]:
checkout_types.plot(kind="bar", figsize=(7,5))

In [None]:
#Let's eliminate the extraneous categories. 
#We can use the .loc method:
filtered_checkout_types = checkout_types.loc[['EBOOK', 'BOOK', 'AUDIOBOOK']]
filtered_checkout_types.plot(kind="bar", figsize=(6,4), xlabel='Type of material')

In [None]:
# We can rotate the x-axis labels!
filtered_checkout_types.plot(kind="bar", figsize=(7,5), xlabel='Type of material', ylabel = 'Number of Checkouts', rot=0.0)

Plotting bar graphs with multiple categories:

**Question**

Explain what the data in the below dataframe represents. 

In [None]:
checkout_types_per_month_df = checkouts2020_df.groupby(['CheckoutMonth', 'MaterialType']).size().unstack(fill_value=0)
checkout_types_per_month_df

If I do the following I get an unreadable mess:

In [None]:
checkout_types_per_month_df.plot(kind='bar')

Instead, I can isolate just the columns I want. 

In [None]:
checkout_types_per_month_df[['BOOK','AUDIOBOOK', 'EBOOK']].plot(kind='bar', \
                                                                color = ['purple', 'red', 'pink'], \
                                                                title='Instances of Material Types Checked Out per Month', \
                                                                ylabel='Number of Checkouts')

In [None]:
#What are some advantages or disadvantages of presenting a stacked plot?
checkout_types_per_month_df[['BOOK','AUDIOBOOK', 'EBOOK']].plot(kind='bar', \
                                                                color = ['purple', 'red', 'pink'], \
                                                                 title='Instances of Material Types Checked Out per Month',\
                                                                stacked = True)

## Pie plots

Here, I'm adding an extra column called `Fiction` in the dataframe. The `Fiction` value is True if "Fiction" is listed as a subject in the corresponding work.

In [None]:
checkouts2020_df['Fiction'] = checkouts2020_df['Subjects'].str.contains('Fiction')
checkouts2020_df

In [None]:
checkouts2020_df['Fiction'].value_counts().plot(kind="pie")

## Line plots

Here, I'm tracking the number of Fiction vs. Nonfiction works checked out in each month.

In [None]:
checkouts_fiction_per_month_df = checkouts2020_df.groupby(['CheckoutMonth', 'Fiction']).size().unstack()
checkouts_fiction_per_month_df.plot(kind='line', xticks=[1,2,3,4,5,6,7,8,9,10,11,12], ylabel='Number of Checkouts')

In [None]:
#We can also isolate certain categories we want.
checkout_types_per_month_df[['BOOK']].plot(kind='line')

# Python dictionaries

Python dictionaries allow us to store data in pairs. This will be useful because we can define dataframes using dictionaries. 

```
example_dict = {
   'key1': value1,
   'key2': value2,
   'key3': value3,
}
```

Note:
- Keys are `string`s; values can be of any data type.
- Note that a `,` comes between each key-value pair your define
- You don't need to arrange things like this typographically, with key-value pairs each on their own line, but it does make things look prettier
- **Order** does not matter in dictionaries, only the key

## Accessing items in a dictionary

I can index a list using square brackets to get a specific entry in the list.

However, I can index a dictionary using the name of a key. This gives me the value associated with that key. 

In [None]:
writers_list = ["William Shakespeare", "Jane Austen", "Leo Tolstoy", "Gabriel Garcia Marquez", "Margaret Atwood"]
writers_list[2]

In [None]:
writers_dict = {
    "William Shakespeare": 1564,
    "Jane Austen": 1775,
    "Leo Tolstoy": 1828,
    "Gabriel Garcia Marquez": 1927,
    "Margaret Atwood": 1939,
}

writers_dict["Leo Tolstoy"]

The order of items in a dictionary doesn't matter, so I can't access items using a numerical index:

In [None]:
writers_dict[2]

**Question**

Run the first cell to create the variable `example_dict`. Predict the outputs of the following cells, then run them to check.

In [None]:
example_dict = {
    'First Name': 'Walt',
    'Last Name': 'Whitman',
    'Born': 1819,
    'Died': 1892,
    'Major Work': 'Leaves of Grass',
    'Genre': 'Poetry',
    'Movements': ['Transcendentalism', 'Realism', 'Free verse'],
    'Contemporaries': ['Oscar Wilde', 'Henry David Thoreau', 'Bram Stoker']
}

In [None]:
example_dict['Genre']

In [None]:
type(example_dict['Born'])

In [None]:
example_dict[1892]

In [None]:
type(example_dict['Movements'])

In [None]:
example_dict['Contemporaries'][1]

**Question**

What does the .keys() method output? It outputs a list, but what does the list contain? 

What does the .values() method output? It outputs a list, but what does the list contain? 

In [None]:
writers_dict.keys()

In [None]:
writers_dict.values()

## Making a pandas DataFrame from a dictionary

I can create a dataframe from a dictionary using the ``DataFrame`` function in the pandas library. I input a dictionary to the ``DataFrame`` function where the **keys** are the names of the column and the corresponding **values** are **lists** containing all the values in that column. 

In [None]:
import pandas as pd
name_list = ["apple", "banana", "blueberry", "mango", "avocado", "grape"]
name_length_list = [5, 6, 9, 5, 7, 5]
colour_list = ["red", "yellow", "blue", "orange", "green", "purple"]

fruits_df = pd.DataFrame({
    'Name': name_list,
    'Name Length': name_length_list,
    'Colour': colour_list
})
fruits_df

In [None]:
#What happens if we try to break it?
name_list = ["apple", "banana", "blueberry", "mango", "avocado", "grape"]
name_length_list = [5, 6, 9, 5, 7, 5, 5, 7, 5]
colour_list = ["red", "yellow", "blue", "orange", "green", "purple"]

fruits_df = pd.DataFrame({
    'Name': name_list,
    'Name Length': name_length_list,
    'Colour': colour_list
})
fruits_df

# Sentiment analysis

## Polarity

In order to do sentiment analysis to a piece of text, I must first convert it into a blob object using the ``TextBlob`` function. Then I can obtain the polarity, subjectivity, etc. of that piece of text. 

In [None]:
from textblob import TextBlob

In [None]:
whitman_sentence = "I celebrate myself, and sing myself, \
         And what I assume you shall assume, \
         For every atom belonging to me as good belongs to you."

In [None]:
# Convert into a blob object

whitman_sentence_blob = TextBlob("I celebrate myself, and sing myself, \
         And what I assume you shall assume, \
         For every atom belonging to me as good belongs to you.")

**Question**

Why does the following give an error?

In [None]:
whitman_sentence.polarity

**Question**

Let's try to understand how polarity is calculated. Why is the polarity of ``whitman_sentence_blob`` 0.7?

In [None]:
print(whitman_sentence_blob)
print(whitman_sentence_blob.polarity)

What is the code below doing? What does it tell us about where the 0.7 value comes from?

In [None]:

for word in whitman_sentence.split(" "):
    print(TextBlob(word).polarity)

Consider the polarity of certain words:

In [None]:
TextBlob("happy").polarity

In [None]:
TextBlob("horrible").polarity

In [None]:
TextBlob("awesome").polarity

**Question**

How does the polarity change when we combine these words?

In [None]:
TextBlob("happy awesome").polarity

In [None]:
TextBlob("happy horrible").polarity
#-0.1

**Question** 

Try to think of a word that will give a polarity of 0.

In [None]:
TextBlob("").polarity

How do 0-polarity words affect the output?

In [None]:
#How do 0-polarity words affect the output?
TextBlob("happy").polarity

What happens with negation? What about other modifiers, like "very"?

In [None]:
TextBlob("not").polarity

In [None]:
TextBlob("not happy").polarity

In [None]:
TextBlob("isn't happy").polarity

In [None]:
TextBlob("very happy").polarity

In [None]:
TextBlob("not very happy").polarity

In [None]:
TextBlob("very not happy")

## Using TextBlob methods

In [None]:
song_of_myself = open("song_of_myself.txt", encoding="utf-8").read()

In [None]:
from textblob import TextBlob
whitman_blob = TextBlob(song_of_myself)

**Question**: 

What do `.words` and `.sentences` output?

In [None]:
whitman_blob.words[:30]

In [None]:
whitman_blob.sentences[:10]

Recall what the `.join()` method does. What does `' '.join(sentence.words))` give? The reason we are doing that is because we want to convert each sentence into a plain old string, not the fancy `Sentence` object that TextBlob creates.

In [None]:
whitman_sentences = []
whitman_polarities = []
whitman_subjectivities = []

whitman_sentences_blob = whitman_blob.sentences

for sentence in whitman_sentences_blob:
    whitman_sentences.append(' '.join(sentence.words))
    whitman_polarities.append(sentence.polarity)
    whitman_subjectivities.append(sentence.subjectivity)

In [None]:
print(whitman_sentences[:7])

In [None]:
whitman_blob.sentiment

Now we can create a dataframe using the lists we just created.

In [None]:
import pandas as pd
whitman_df = pd.DataFrame({
    'sentence': whitman_sentences,
    'polarity': whitman_polarities,
    'subjectivity': whitman_subjectivities
})

whitman_df

**Question:**

How can I change the line 

`whitman_df.sort_values(by='polarity', ascending = True)`

to obtain only the 10 sentences with the highest polarity? 

*Hint: How do you slice a list? The syntax for slicing a dataframe is the same.*


What about the 10 sentences with the lowest polarity? 

*Hint: You need to change the `ascending` parameter.*

What about the 10 sentences with highest *subjectivity*?

*Hint: You need to change the `by` parameter.*

In [None]:
#The following line allows us the output to display the full sentence. 
pd.set_option('display.max_colwidth', 0)


whitman_df.sort_values(by='polarity', ascending = True)

In [None]:
whitman_df[['polarity']].plot(kind='line', figsize=(10,5))

In [None]:
#How can we use this to calculate a TTR?
whitman_blob.word_counts