# This is a Jupyter Notebook

A Jupyter Notebook is a data-science environment that combines:

1. **Narrative:** The text describing your analysis
2. **Code:** The program that does the analysis
3. **Results:** The output of the program

The Jupyter environment was created by faculty at UC Berkeley (Fernando Perez). These ideas are now in a lot of different technologies (e.g., Google Collab, Microsoft DeepNote, GitHub Codespaces, etc.
). 

<br/><br/>

The link from Canvas sets up the notebook in an cloud environment, which allows you to work on any device. 


<br/><br/>

---

## Example

**Note:** In this lecture there is a lot of code. You are not expected to know any of this yet. This is just a preview of the things you will see in the next few weeks. 

<br/><br/><br/>



### Getting the Data

We can use the tools of data science to study text.  For example, here we will do some basic analysis of *["Adventures of Huckleberry Finn"](https://en.wikipedia.org/wiki/Adventures_of_Huckleberry_Finn)* (by Mark Twain) and from *["Little Women"](https://en.wikipedia.org/wiki/Little_Women)* (by Louisa May Alcott).  

Often the first step in data sciences is getting the data.  The following is a tiny program to download text from the web.

In [None]:
# A tiny program to download text from the web.
def read_url(url): 
    from urllib.request import urlopen 
    import re
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

We ran this code cell above by using: SHIFT+ENTER

<br/><br/>

Here we download the books from the [*Computational and Inferential Thinking* textbook website](https://inferentialthinking.com/chapters/intro.html).  This is the textbook you will be using for this class.

In [None]:
huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = read_url(huck_finn_url)
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]

In [None]:
little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)
little_women_chapters = little_women_text.split('CHAPTER ')[1:]

<br/><br/>

Let's look at the text from the first chapter of Huckleberry Finn:

In [None]:
# write some code here


<br/><br/>
# Working with Tables

A lot of data science is about transforming data often to produce tables that we can more easily analyze.
In this class you will use the `datascience` library to manipulate and data.  This Python package was developed at UC Berkeley to support data science education. 

In [None]:
import datascience
datascience.__version__

In [None]:
from datascience import *

Take the text from the different chapters above and put it in a Table. 

In [None]:
Table().with_column('Chapters', huck_finn_chapters)

<br/><br/>

### Summaring Data 

We will explore data by extracting summaries. For example, we might ask, how often characters appeared in each chapter. We can use snippets of code to answer these questions.

In [None]:
import numpy as np

_**Question**_  How often is Tom mentioned in each chapter? 

In [None]:
np.char.count(huck_finn_chapters, 'Tom')

In [None]:
np.char.count(huck_finn_chapters, 'Jim')

_**Question**_ *What can we say about these characters just from the numbers?* 




<br/><br/><br/>


We can convert the results of the analysis above (more data) into tables.

In [None]:
counts = Table().with_columns([
    'Tom', np.char.count(huck_finn_chapters, 'Tom'),
    'Jim', np.char.count(huck_finn_chapters, 'Jim'),
    'Huck', np.char.count(huck_finn_chapters, 'Huck'),
])
counts

_**Question**_ The book is called the *Adventures of Huckleberry Finn*, but Huck does not seem to be mentioned often.  Why? 


<br/><br/><br/><br/>


# We will Learn to Visualize Data

Plot the cumulative counts:
How many times in Chapter 1, how many times in Chapters 1 and 2, and so on.


In [None]:
# %matplotlib inline
# import matplotlib.pyplot as plt
# plt.style.use('fivethirtyeight')
# cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 44, 1))
# cum_counts.plot(column_for_xticks="Chapter")
# plt.title('Cumulative Number of Times Name Appears');

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
cum_counts = Table().with_columns([
    'Tom', counts['Tom'].cumsum(), 
    'Jim', counts['Jim'].cumsum(), 
    'Huck', counts['Huck'].cumsum(), 
    'Chapter', np.arange(1, 44, 1),
])
cum_counts.plot(column_for_xticks = "Chapter")
plt.title('Cumulative Number of Times Name Appears');

_**Question**_ What can we tell from this visualization?  
What questions does this raise?



<br/><br/><br/>
Let's examine this idea with another book, *Little Women*.
<br/><br/><br/>

In [None]:
# The chapters of Little Women
Table().with_column('Chapters', little_women_chapters)

<br/><br/><br/><br/><br/>

We can explore the characters in Little Women using the same kind of analysis.

In [None]:
# Counts of names in the chapters of Little Women
names = ['Amy', 'Beth', 'Jo', 'Laurie', 'Meg']
mentions = {name: np.char.count(little_women_chapters, name) for name in names}
counts = Table().with_columns([
        'Amy', mentions['Amy'],
        'Beth', mentions['Beth'],
        'Jo', mentions['Jo'],
        'Laurie', mentions['Laurie'],
        'Meg', mentions['Meg']
    ])

In [None]:
# Look at the counts 
counts

In [None]:
# Plot the cumulative counts
Table.static_plots()
#cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 48, 1))
cum_counts = Table().with_columns([
    'Amy', counts['Amy'].cumsum(), 
    'Beth', counts['Beth'].cumsum(), 
    'Jo', counts['Jo'].cumsum(), 
    'Laurie', counts['Laurie'].cumsum(),
    'Meg', counts['Meg'].cumsum(), 
    'Chapter', np.arange(1, 48, 1),
])
cum_counts.plot(column_for_xticks=5)
plt.title('Cumulative Number of Times Name Appears');

Book is about 4 sisters: Amy, Beth, Jo and Meg.  Laurie is their neighbor.  

_**Question**_ What can we say about the book and it's plot from just this graph? **SPOILERS**




<br/><br/><br/>

We can use interactive tools.

In [None]:
# Plot the cumulative counts
Table.interactive_plots()
cum_counts.plot(column_for_xticks=5)

<br/><br/><br/><br/><br/> 

---

# Examining Length

How long are the books? How long are sentences?


In [None]:
len(read_url(huck_finn_url))

In [None]:
# In each chapter, count the number of all characters;
# call this the "length" of the chapter.
# Also count the number of periods.

length_hf = Table().with_columns([
        'Length', [len(s) for s in huck_finn_chapters],
        'Periods', np.char.count(huck_finn_chapters, '.')
    ])
length_lw = Table().with_columns([
        'Length', [len(s) for s in little_women_chapters],
        'Periods', np.char.count(little_women_chapters, '.')
    ])

In [None]:
# The counts for Huckleberry Finn
length_hf

In [None]:
# The counts for Little Women
length_lw

In [None]:
Table.static_plots()
plt.figure(figsize=(8,8))
plt.scatter(length_hf[1], length_hf[0], color='darkblue', label='Adv.HuckFinn')
plt.scatter(length_lw[1], length_lw[0], color='gold', label='LittleWomen')
plt.xlabel('Number of periods in chapter')
plt.ylabel('Number of characters in chapter');
plt.legend();


<br/><br/><br/><br/><br/>

---

## Examining distributions

In [None]:
Table.static_plots()
length_hf.with_columns("Sentence Length", length_hf['Length']/length_hf['Periods']).hist("Sentence Length")
plt.title('Huckleberry Finn');

In [None]:
Table.static_plots()
length_lw.with_columns("Sentence Length", length_lw['Length']/length_lw['Periods']).hist("Sentence Length")
plt.title('Little Women');