## Intermediate Data Science

#### University of Redlands - DATA 201
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data201.joannabieri.com](https://joannabieri.com/data201_intermediate.html)

In [1]:
# Some basic package imports
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.defaule = 'colab'

### You Try - 4 Warm-Up Problems From Lecture

## You try

What does frame.unstack() do in this case. Go ahead and run the command and see if you can understand the results.

In [2]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
                     columns=[["Ohio", "Ohio", "Colorado"],
                              ["Green", "Red", "Green"]])
frame.index.names = ["key1", "key2"]
frame.columns.names = ["state", "color"]

frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


 So the frame.unstack() is taking the inner row index (key 2) and moves it into the columns. This means that rows are now just key 1, a and b. And the columns are now bigger, as they include state, color, and key 2.

-------------------------------------
## You Try

How would you swap the index keys? See if you can swap key1 and key2 in the new_frame.


In [4]:
new_frame = frame.swaplevel('state','color', axis=1)
new_frame

Unnamed: 0_level_0,color,Green,Red,Green
Unnamed: 0_level_1,state,Ohio,Ohio,Colorado
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [4]:
# Your code here
new_frame = frame.swaplevel('key1', 'key2')
new_frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


-------------------------------------------------------
## You Try

Merge the following data sets using all for ways: inner, left, right, and outer. See if you can predict before running the code what the output will be!

In [6]:
df_animals = pd.DataFrame({
    'animal_id': [1, 2, 3, 4],
    'name': ['Leo', 'Stripes', 'Spot', 'Fluffy'],
    'type': ['Lion', 'Tiger', 'Cheetah', 'Cat']
})

df_habitats = pd.DataFrame({
    'animal_id': [1, 2, 5, 4],
    'habitat': ['Savannah', 'Jungle', 'Mountains', 'Domestic'],
    'population_estimate': [25000, 3200, 120, 50000000]
})

display(df_animals)
display(df_habitats)

Unnamed: 0,animal_id,name,type
0,1,Leo,Lion
1,2,Stripes,Tiger
2,3,Spot,Cheetah
3,4,Fluffy,Cat


Unnamed: 0,animal_id,habitat,population_estimate
0,1,Savannah,25000
1,2,Jungle,3200
2,5,Mountains,120
3,4,Domestic,50000000


**Your Prediction**
- inner: animal_id in both {1, 2, 4}
- left: all from animals {1, 2, 3, 4}
- right: all from habitats {1, 2, 4, 5}
- outer: union of both {1, 2, 3, 4, 5}

In [7]:
## Your code
pd.merge(df_animals, df_habitats, on='animal_id', how='inner')

Unnamed: 0,animal_id,name,type,habitat,population_estimate
0,1,Leo,Lion,Savannah,25000
1,2,Stripes,Tiger,Jungle,3200
2,4,Fluffy,Cat,Domestic,50000000


In [8]:
## Repeat for left, right, outer
pd.merge(df_animals, df_habitats, on='animal_id', how='left')

pd.merge(df_animals, df_habitats, on='animal_id', how='right')

pd.merge(df_animals, df_habitats, on='animal_id', how='outer')


Unnamed: 0,animal_id,name,type,habitat,population_estimate
0,1,Leo,Lion,Savannah,25000.0
1,2,Stripes,Tiger,Jungle,3200.0
2,3,Spot,Cheetah,,
3,4,Fluffy,Cat,Domestic,50000000.0
4,5,,,Mountains,120.0


-------------------------------
## You Try

Do a pivot on your merged animal data. You can decide how to pivot, but try to say before running the code what you expect to happen.

**My Prediction:** If I pivot the merged animal data with the type as the index, habitat as the columns, and population estimate as the values, there will be a table where each row is a animal type, and each column will be a habitat. Most cells will be NaN since each animal can only havve one habitat.

In [9]:
# Your code here
merged = pd.merge(df_animals, df_habitats, on='animal_id', how='outer')

pd.pivot_table(
    merged,
    index='type',
    columns='habitat',
    values='population_estimate'
)

habitat,Domestic,Jungle,Savannah
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cat,50000000.0,,
Lion,,,25000.0
Tiger,,3200.0,


---------------
## Data Wrangling - Day5 HW

## Homework 5

Using all three datasets below we would like to determine if the usage patterns for users differ between different devices. See if you can ask some questions of your own. Here are some examples:

1. Does the platform being used impact the number of monthly mb used? 
2. Do users using Samsung devices use more call minutes than those using LGE devices? 

Idea from: https://www.kaggle.com/code/vin1234/merge-join-and-concat-with-pandas
Author: Vinay Vikram

- Looking at the data what columns can be used for merging? Do you see any you might need/want to rename?
- Make sure to say what you are doing in the merge and why you are choosing the specific merge type.
- Explain in detail your approach to answering the question, there is more than one right answer!
    
------------------------------------

Your final notebooks should:

- [ ] Be a completely new notebook with just the Day5 stuff in it: Read in the data, merge it, answer a minimum of 3 questions. 
- [ ] Be reproducible with junk code removed.
- [ ] Have lots of language describing what you are doing, especially for questions you are asking or things that you find interesting about the data. Use complete sentences, nice headings, and good markdown formatting: https://www.markdownguide.org/cheat-sheet/
- [ ] It should run without errors from start to finish.


In [11]:
user_usage=pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/user_usage.csv')
user_usage.head(10)

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id
0,21.97,4.82,1557.33,22787
1,1710.08,136.88,7267.55,22788
2,1710.08,136.88,7267.55,22789
3,94.46,35.17,519.12,22790
4,71.59,79.26,1557.33,22792
5,71.59,79.26,1557.33,22793
6,71.59,79.26,519.12,22794
7,71.59,79.26,519.12,22795
8,30.92,22.77,3114.67,22799
9,69.8,14.7,25955.55,22801


In [12]:
user_device=pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/user_device.csv')
user_device.head(10)

Unnamed: 0,use_id,user_id,platform,platform_version,device,use_type_id
0,22782,26980,ios,10.2,"iPhone7,2",2
1,22783,29628,android,6.0,Nexus 5,3
2,22784,28473,android,5.1,SM-G903F,1
3,22785,15200,ios,10.2,"iPhone7,2",3
4,22786,28239,android,6.0,ONE E1003,1
5,22787,12921,android,4.3,GT-I9505,1
6,22788,28714,android,6.0,SM-G930F,1
7,22789,28714,android,6.0,SM-G930F,1
8,22790,29592,android,5.1,D2303,1
9,22791,28775,ios,10.2,"iPhone6,2",3


In [13]:
device=pd.read_csv('https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/android_devices.csv')
device.head(10)

Unnamed: 0,Retail Branding,Marketing Name,Device,Model
0,,,AD681H,Smartfren Andromax AD681H
1,,,FJL21,FJL21
2,,,T31,Panasonic T31
3,,,hws7721g,MediaPad 7 Youth 2
4,3Q,OC1020A,OC1020A,OC1020A
5,7Eleven,IN265,IN265,IN265
6,A.O.I. ELECTRONICS FACTORY,A.O.I.,TR10CS1_11,TR10CS1
7,AG Mobile,AG BOOST 2,BOOST2,E4010
8,AG Mobile,AG Flair,AG_Flair,Flair
9,AG Mobile,AG Go Tab Access 2,AG_Go_Tab_Access_2,AG_Go_Tab_Access_2
