## Intermediate Data Science

#### University of Redlands - DATA 201
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data201.joannabieri.com](https://joannabieri.com/data201_intermediate.html)

In [1]:
# Some basic package imports
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

In [34]:
# Make sure that you load the new packages from lecture if needed
# !conda install -y lxml beautifulsoup4 html5lib
# !conda install -y openpyxl xlrd
# !conda install -y requests

### You Try - 3 Warm-Up Problems From Lecture

Here is a file that does not just read in nicely. See if you can use optional arguments to read it in.

*Hint* How many (and which) rows of this data are just junk?

**Terminal Command Line:**

The command

        cat data/ex4.csv

if typed into a terminal prints out the contents of the file line by line. This lets us take a quick look at what is in the file. BEWARE - if you do this with a large file it will take a long time to print! Another great command is:

        head data/ex4.csv

would just show the first 10 lines of the file!

In [5]:
# This code lets you look at the data
# the terminal command "cat" - prints the contents of a file
# when we do !cat filename we can look at the 
file_name = 'data/ex4.csv'
!cat data/ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [25]:
# The first, third, and fourth lines are junk
# skiprows based on index
file_name = 'data/ex4.csv'
pd.read_csv(file_name, skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


------------------------------------------------------

In [27]:
# EXAMPLES - DICTIONARY TO PANDAS
my_dict = {"name": "Wes",
 "cities_lived": ["Akron", "Nashville", "New York", "San Francisco"],
 "pet": None,
 "siblings": [{"name": "Scott", "age": 34, "hobbies": ["guitars", "soccer"]},
              {"name": "Katie", "age": 42, "hobbies": ["diving", "art"]}]
}
my_dict

{'name': 'Wes',
 'cities_lived': ['Akron', 'Nashville', 'New York', 'San Francisco'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 34, 'hobbies': ['guitars', 'soccer']},
  {'name': 'Katie', 'age': 42, 'hobbies': ['diving', 'art']}]}

In [29]:
for key in my_dict.keys():
    print(key)
    print(my_dict[key])
    print('----------')

name
Wes
----------
cities_lived
['Akron', 'Nashville', 'New York', 'San Francisco']
----------
pet
None
----------
siblings
[{'name': 'Scott', 'age': 34, 'hobbies': ['guitars', 'soccer']}, {'name': 'Katie', 'age': 42, 'hobbies': ['diving', 'art']}]
----------


In [30]:
# This code will give you an error
df = pd.DataFrame(my_dict)
df

ValueError: All arrays must be of the same length

In [31]:
# This code will work
df = pd.DataFrame(my_dict['siblings'])
df

Unnamed: 0,name,age,hobbies
0,Scott,34,"[guitars, soccer]"
1,Katie,42,"[diving, art]"


### You Try:

Can you explain what is going on in the examples above? Why does one give an error and the other works? What specifically is it about focusing in on the siblings data that allows pandas to read this?

**Your explanation here:**
df = pd.DataFrame(my_dict) gives an error because each key has a different length, meaning it cannot be read in by pandas. df = pd.DataFrame(my_dict['siblings']) works because each key contains a dictionary with the same number of key-value pairs with the same keys. Therefore, pandas can read it in as the keys being columns and the values being data in the dataframe.

---------------------------------------

### You Try

Here is an example website that contains a table:

https://www.scrapethissite.com/pages/forms/

1. Open the website in your browser. Does the page that appears contain ALL the data about hockey teams?
2. How does the web address change when you select the second page of the website.
3. See if you can write code that will scrape all of the data. HINT: I would use a for loop that updates the web address and appends the new table to a list.
4. Once you have the list of tables can you get them into a single data frame and save the data as a .csv?

In [38]:
# Here is how I could get one page
website = 'https://www.scrapethissite.com/pages/forms/'
tables = pd.read_html(website)
len(tables)

1

In [53]:
# Your code here
# create list of page numbers
# page numbers added to url as seen below
# loop for all page numbers
# append each table to list
table_list = []
pages = list(range(1, 25))
for n in pages:
    website = 'https://www.scrapethissite.com/pages/forms/?page_num=' + str(n)
    tables = pd.read_html(website)
    table_list.append(tables[0])
df = pd.concat(table_list)
df.to_csv('hockey_teams')
df

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.550,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
...,...,...,...,...,...,...,...,...,...
2,Tampa Bay Lightning,2011,38,36,8.0,0.463,235,281,-46
3,Toronto Maple Leafs,2011,35,37,10.0,0.427,231,264,-33
4,Vancouver Canucks,2011,51,22,9.0,0.622,249,198,51
5,Washington Capitals,2011,42,32,8.0,0.512,222,230,-8


------------------------------------------------

## Reading and Writing Data - Day3 HW

## Homework 3

Go to Kaggle Datasets: https://www.kaggle.com/datasets

Find a data set that you are interested in looking at. You are welcome to work together and choose a data set as a group! You should read in this data and do some basic statistics on the data set. Answer the following questions:

1. Tell your reader about the data: Where did you get it? When did you access it? Who owns it? What is the license? Are there any acknowledgments that you should give for using the data? All of this should be on the Kaggle page
2. How many variables and observations?
3. What type of data is contained? Was it read in as strings, ints, floats?
4. Are there any NaNs or weird data types that you can see?
5. Most Kaggle datasets contain some basic stats or visualizations on the download page. See if you can recreate some of the plots or data you see on the website.
6. Come up with at least one question of your own that you can answer by analyzing the data.
7. Create a dataframe with just the data you need to answer your question - save the data subset to a file (your choice of type)
8. **In a NEW NOTEBOOK** Write code that reads in your subset of the data, markdown that explains clearly where you got the data originally (license and references included) and the process you took to create your subset, a description of the question you are answering, and code that can reliably run and answer your question followed by words that explain your results.

------------------------------------

Your final notebooks should:

- [ ] Be completely new notebooks with just the Day3 stuff in it: First the code that creates your data and second the code that reads in the data and does the analysis. 
- [ ] Be reproducible with junk code removed.
- [ ] Have lots of language describing what you are doing, especially for questions you are asking or things that you find interesting about the data. Use complete sentences, nice headings, and good markdown formatting: https://www.markdownguide.org/cheat-sheet/
- [ ] It should run without errors from start to finish.

DONT FORGET TO HAND IN YOUR DATA!!!!