## Intermediate Data Science

#### University of Redlands - DATA 201
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data201.joannabieri.com](https://joannabieri.com/data201_intermediate.html)

In [41]:
# Some basic package imports
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

### You Try - 3 Warm-Up Problems From Lecture

Run the data cell to see the data then answer the questions.

Each problem is separated by a line.

------------------------------------------------------

In [42]:
#DATA
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
example_series = pd.Series(sdata)
example_series

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

### You Try:

Using series methods and python code:

1. Get a list of the states that are the index.
2. Check if Idaho is in the index.
3. Get just the states with numbers less then 20,000.
4. Run the provided code and explain the results. Why do we end up with NaN and what does NaN mean?

In [43]:
list(example_series.index)

['Ohio', 'Texas', 'Oregon', 'Utah']

In [44]:
'Idaho' in example_series.index

False

In [45]:
example_series[example_series < 20000]

Oregon    16000
Utah       5000
dtype: int64

In [46]:
# Explain the results 4
states = ["California", "Ohio", "Oregon", "Texas"]
new_series = pd.Series(sdata, index=states)
new_series

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

We end up with NaN, which stands for Not a Number, which is the pandas placeholder for missing data. This means that no value existed for California in the original data.

---------------------------------------

In [47]:
# DATA
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data)
df

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


### You Try

1. How does the following command work?
2. See if you can add a column that check if the year is greater than 2001.

In [48]:
# How does this work 1
df['eastern'] = df['state'] == 'Ohio'
df

Unnamed: 0,state,year,pop,eastern
0,Ohio,2000,1.5,True
1,Ohio,2001,1.7,True
2,Ohio,2002,3.6,True
3,Nevada,2001,2.4,False
4,Nevada,2002,2.9,False
5,Nevada,2003,3.2,False


so df['state'] == 'Ohio' compares each element in the Series to the string "Ohio", seeing if any of them are true when equal to Ohio. The we needed to assign df['eastern'], which creates a new column named eastern and pandas fills it with boolean values. 

In [49]:
df['year_check'] = df['year'] < 2001
df

Unnamed: 0,state,year,pop,eastern,year_check
0,Ohio,2000,1.5,True,True
1,Ohio,2001,1.7,True,False
2,Ohio,2002,3.6,True,False
3,Nevada,2001,2.4,False,False
4,Nevada,2002,2.9,False,False
5,Nevada,2003,3.2,False,False


------------------------------------------------

In [50]:
# DATA
df = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
df

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


### You Try
1. Get rows a and d for just Ohio and Texas.

In [51]:

df = pd.DataFrame(np.arange(4).reshape((2, 2)),
                     index=["a", "d"],
                     columns=["Ohio", "Texas",])
df

Unnamed: 0,Ohio,Texas
a,0,1
d,2,3


----------------------------------------

## Market Data from Lecture

In [52]:
# Lets read in some data and look at some statistics
price = pd.read_pickle("data/yahoo_price.pkl")
volume = pd.read_pickle("data/yahoo_volume.pkl")

In [53]:
returns = price.pct_change()
returns.head()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,,,,
2010-01-05,0.001729,-0.004404,-0.01208,0.000323
2010-01-06,-0.015906,-0.025209,-0.006496,-0.006137
2010-01-07,-0.001849,-0.02328,-0.003462,-0.0104
2010-01-08,0.006648,0.013331,0.010035,0.006897


In [54]:
returns.describe()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
count,1713.0,1713.0,1713.0,1713.0
mean,0.000972,0.000671,0.000236,0.000595
std,0.016641,0.01583,0.012102,0.014667
min,-0.123558,-0.083775,-0.08279,-0.113995
25%,-0.007516,-0.006904,-0.006049,-0.007376
50%,0.000886,0.00027,0.000234,0.000312
75%,0.010422,0.008462,0.006806,0.008162
max,0.088741,0.160524,0.056652,0.104522


In [55]:
returns['AAPL'].corr(returns['IBM'])

np.float64(0.3868174361139099)

In [56]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


In [57]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000277,0.000107,7.8e-05,9.5e-05
GOOG,0.000107,0.000251,7.8e-05,0.000108
IBM,7.8e-05,7.8e-05,0.000146,8.9e-05
MSFT,9.5e-05,0.000108,8.9e-05,0.000215


--------------------------

## Pandas Analysis - Day2 HW

In [58]:
#!conda install kagglehub

In [61]:
!pip install kagglehub

Collecting kagglehub
  Downloading kagglehub-0.3.13-py3-none-any.whl.metadata (38 kB)
Downloading kagglehub-0.3.13-py3-none-any.whl (68 kB)
Installing collected packages: kagglehub
Successfully installed kagglehub-0.3.13


In [64]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("yasserh/titanic-dataset")

print("Path to dataset files:", path)

Path to dataset files: /Users/admin/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1


In [67]:
# Copy the path to the data set
# For me this was:
file = '/Users/admin/.cache/kagglehub/datasets/yasserh/titanic-dataset/versions/1/Titanic-Dataset.csv'
# Yours will be different!

df = pd. read_csv(file)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


**Your goal is to do a quick analysis of the Titanic data! You can answer any questions that you find interesting but here are some things to start with:**

1. How many variables and observations? Which are Numerical/Categorical?
2. Do any of the columns have NaNs in them? What do NaNs mean?
3. How many passengers survived?
4. Is survival correlated with Fare?
5. How many passengers were alone vs. traveling with family?
6. Were people traveling alone more or less likely to survive?
7. Do the basic statistics change if you group by class?

and so on... see if you can come up with some questions of your own! Curiosity is a big part of data science!

How far can you get in just an hour or two?


---------------------------------

**Variable Notes**
- PassengerId:   Unique ID of the passenger
- Survived:   Survived (1) or died (0)
- Pclass:   Passenger’s class (1st, 2nd, or 3rd)
- Name:   Passenger’s name
- Sex:   Passenger’s sex
- Age:   Passenger’s age
- SibSp:   Number of siblings/spouses aboard the Titanic
- Parch:   Number of parents/children aboard the Titanic
- Ticket:   Ticket number
- Fare:   Fare paid for ticket
- Cabin:   Cabin number
- Embarked:   Where the passenger got on the ship (C — Cherbourg, S — Southampton, Q = Queenstown)

------------------------------------

Your final notebook should:

- [ ] Be a completely new notebook with just the Titanic stuff in it: HW2-Titanic.ipynb
- [ ] Be reproducible with junk code removed.
- [ ] Have lots of language describing what you are doing, especially for questions you are asking or things that you find interesting about the data. Use complete sentences, nice headings, and good markdown formatting: https://www.markdownguide.org/cheat-sheet/
- [ ] It should run without errors from start to finish.