#Introduction
In this activitity we will explore the data that you created from your class survey. For categorical data, we will summarize using `.value_counts()` and for numeric data we will look at `.mean()`, `.median()` and `.describe()`.  



# Header Block

We will always start by importing the libraries. These will give Python the access to all of the functions that we will use. Every notebook will begin with the same block. And this will not change from activity to activity. Copy this and run this before anything else. 

In [1]:
# This header will be the same no matter what code you are using
# import modules that we will use multiple functions from and give them short names. 

import pandas as pd;
import numpy as np;
import seaborn as sb;
import matplotlib.pyplot as plt;


# import single functions

from scipy.stats.contingency import chi2_contingency;
from itertools import combinations;
from statsmodels.graphics.mosaicplot import mosaic;
from scipy.stats.contingency import chi2_contingency;
from scipy.stats import pearsonr;

The data management block will be pretty small at this moment, because all we have to do with this data is import it. In this block we will get our data ready for analysis. This will include choosing variables, managing non-responses, collapsing variables and re-coding categorical variabls. We don't have to do any of that with our survey! 


In [2]:
s23_survey_url = 'https://drive.google.com/uc?export=download&id=1F8L7t-nfY9LlY3_zVzp-wALwwkMT13CZ'

myData = pd.read_csv(s23_survey_url)

After you first import your data it is helpful to look at only a few responses to get a sense for what is contained in the data. We can do this using the `.head()` function. The command below looks at the first 7 responses of for my spring 2020 survey data. 

In [5]:
myData.head(10)

Unnamed: 0,Timestamp,age,hs_name,interest_area,major,school_transport,ethnicity,music_service,prefered_pronouns,love_math,work,soda_choice,height_in,commute_dist_mi
0,1/23/23 15:49,18,Sequoia High School,"Art, Design & Performance Interest Area",Digital Art and Animation,"Automobile (Car, Truck, Van, etc..)",Hispanic,YouTube Music,He/Him,No,No,Coke,70,5.0
1,1/23/23 16:06,24,Woodside High School,Science and Health Interest Area,undeclared,"Automobile (Car, Truck, Van, etc..)",Hispanic,Apple Music,She/Her,No,No,Coke,63,6.5
2,1/23/23 18:17,24,Sequoia High School,"Art, Design & Performance Interest Area",Music,"Public Transport (Bus, Rail, Ferry, etc... )",Hispanic,Spotify,He/Him,Prefer not to say.,Yes,Coke,69,5.9
3,1/23/23 18:32,19,Half Moon Bay High School,"Art, Design & Performance Interest Area",Fashion design,"Automobile (Car, Truck, Van, etc..)",Black or African American,Apple Music,He/Him,No,Yes,Coke,5,16.0
4,1/23/23 23:45,19,Woodside High School,Human Behavior & Culture Interest Area,Elementary Education,"Public Transport (Bus, Rail, Ferry, etc... )",Hispanic,Spotify,any pronouns,No,No,Coke,69,4.0
5,1/24/23 11:47,19,Burlingame High School,Science and Health Interest Area,Nursing,"Automobile (Car, Truck, Van, etc..)",Multiple ethnicity/ Other,Apple Music,She/Her,Prefer not to say.,Yes,Coke,5,11.0
6,1/24/23 14:53,22,Hilo High School,Business Interest Area,Business Administration,"Automobile (Car, Truck, Van, etc..)",Multiple ethnicity/ Other,Spotify,He/Him,Yes,No,Coke,74,8.0
7,1/24/23 16:47,20,Woodside High School,None of the above.,Undecided,"Automobile (Car, Truck, Van, etc..)",Hispanic,YouTube Music,She/Her,Prefer not to say.,No,Coke,62,11.0
8,1/24/23 17:11,19,Sequoia High School,Business Interest Area,Business,"Automobile (Car, Truck, Van, etc..)",Hispanic,Spotify,She/Her,No,Yes,Coke,62,4.0
9,1/24/23 18:31,22,No.91,Human Behavior & Culture Interest Area,Psychology,"Automobile (Car, Truck, Van, etc..)",White / Caucasian,Spotify,He/Him,Yes,Yes,Pepsi,66,20.0


`.value_counts()` is the function that takes the responses for a particular variable and counts the number of resposes for each. The result is what we will call a **frequency distribution**. This is particularly useful when dealing with categorical variables. Below I am taking the "hair color" variable and summarizing the responses. 


In [9]:
myData["interest_area"].value_counts()

Human Behavior & Culture Interest Area     9
Business Interest Area                     8
Science and Health Interest Area           7
Art, Design & Performance Interest Area    6
None of the above.                         3
Name: interest_area, dtype: int64

If we would like to look at the relative frequency of each response, we pass the `normalize=True` argument to the `.value_counts()` command. The relative frequency is the proportion of the responses that are in each category.

In [11]:
myData["interest_area"].value_counts(normalize=True)

Human Behavior & Culture Interest Area     0.272727
Business Interest Area                     0.242424
Science and Health Interest Area           0.212121
Art, Design & Performance Interest Area    0.181818
None of the above.                         0.090909
Name: interest_area, dtype: float64

We can make these proportions into percentages by multiplying by 100. 

In [12]:
myData["hs_name"].value_counts(normalize=True)*100

Woodside High School               27.272727
Sequoia High School                21.212121
Menlo Atherton High School         12.121212
Half Moon Bay High School           3.030303
Burlingame High School              3.030303
Hilo High School                    3.030303
No.91                               3.030303
John O' Connell High School         3.030303
Santa Teresa High School            3.030303
Piner High School                   3.030303
Athens Drive High School            3.030303
Los Altos High School               3.030303
Mid-Pen High School                 3.030303
Junipero Serra High School          3.030303
Mercy Burlingame High School        3.030303
South San Francisco High School     3.030303
Name: hs_name, dtype: float64

Note that this just moves the decimal place two places to the right. As you get more comfortable, this will become a natural conversion. 

`.describe()` is a function that tries to create a numeric summary of the variable. For categoric variables, it gives you the sample size, how many different unique responses there are and what is the most frequently occuring response. 

In [13]:
myData["hs_name"].describe()

count                       33
unique                      16
top       Woodside High School
freq                         9
Name: hs_name, dtype: object

For a numeric variable the output is different. 

In [14]:
myData['age'].describe()

count    33.000000
mean     21.000000
std       5.006246
min      18.000000
25%      19.000000
50%      19.000000
75%      22.000000
max      45.000000
Name: age, dtype: float64

For a numeric variable, we can find the average distance by using the `.mean()` command.

In [None]:
myData["age"].mean()

21.0

One thing that categorical varaiables are good for is for grouping our responses so that we can make comparisons. We can use the `.groupby()` function to group the responses by a certain condition. The command below groups the responses by high school and then computes the average ages for the students in each group. 

In [15]:
myData.groupby("interest_area")["age"].mean()

interest_area
Art, Design & Performance Interest Area    19.333333
Business Interest Area                     23.625000
Human Behavior & Culture Interest Area     20.000000
None of the above.                         21.333333
Science and Health Interest Area           20.571429
Name: age, dtype: float64

Now take those commands and adapt them to your first coding activity. Good luck! 