# IS362 - Project Two

Choose any three of the “wide” datasets identified in the Week 5 Discussion items. For each of the three chosen datasets:
• Create a .CSV file (or optionally, a MySQL database!) that includes all of the information
included in the dataset. You’re encouraged to use a “wide” structure similar to how the
information appears in the discussion item, so that you can practice tidying and
transformations as described below.
• Read the information from your .CSV file into Python, and use pandas as needed to tidy
and transform your data. [Most of your grade will be based on this step!]
• Perform the analysis requested in the discussion item.
• Your code should be in an Jupyter Notebook, posted to your GitHub repository, and
should include narrative descriptions of your data cleanup work, analysis, and
conclusions. 

The three data sets which I have selected for this project are the California Cities data set, the NYC Cause of Death data set, and the Student data set. I have added all three data sets to the repository.

Let's begin with the imports we'll need to begin cleaning and analyzing this data:

In [1]:
import pandas as pd
import numpy as np

# California Cities Data Analysis

Let's start with the California Cities data set. This data set contains information on a variety of cities in the state of California which includes their name, coordinates, elevation, population, total area, and the percentages of land area vs water area. 

In [2]:
calidata = pd.read_csv("C:/Users/Jessica/Desktop/John's School/IS362/IS362_Project2/california_cities_data.csv")
calidata

Unnamed: 0.1,Unnamed: 0,city,latd,longd,elevation_m,elevation_ft,population_total,area_total_sq_mi,area_land_sq_mi,area_water_sq_mi,area_total_km2,area_land_km2,area_water_km2,area_water_percent
0,0,Adelanto,34.576111,-117.432778,875.0,2871.0,31765,56.027,56.009,0.018,145.107,145.062,0.046,0.03
1,1,AgouraHills,34.153333,-118.761667,281.0,922.0,20330,7.822,7.793,0.029,20.260,20.184,0.076,0.37
2,2,Alameda,37.756111,-122.274444,,33.0,75467,22.960,10.611,12.349,59.465,27.482,31.983,53.79
3,3,Albany,37.886944,-122.297778,,43.0,18969,5.465,1.788,3.677,14.155,4.632,9.524,67.28
4,4,Alhambra,34.081944,-118.135000,150.0,492.0,83089,7.632,7.631,0.001,19.766,19.763,0.003,0.01
5,5,AlisoViejo,33.575000,-117.725556,127.0,417.0,47823,7.472,7.472,0.000,19.352,19.352,0.000,0.00
6,6,Alturas,41.487222,-120.542500,1332.0,4370.0,2827,2.449,2.435,0.014,6.342,6.306,0.036,0.57
7,7,AmadorCity,38.419444,-120.824167,280.0,919.0,185,0.314,0.314,0.000,0.813,0.813,0.000,0.00
8,8,AmericanCanyon,38.168056,-122.252500,14.0,46.0,19454,4.845,4.837,0.008,12.548,12.527,0.021,0.17
9,9,Anaheim,33.836111,-117.889722,48.0,157.0,336000,50.811,49.835,0.976,131.600,129.073,2.527,1.92


While we can see that the data in this data set is not necessarily poorly organized, it does have a lot of null values and zero values, so any analysis we want to do on this data will require that we address that. We also see that the first column has no name or explanation for the data it contains. They appear to be unique identifiers for the cities, so let's start by giving that column a descriptive name.

In [3]:
calidata.columns.values[0] = 'city_id'
calidata.head()

Unnamed: 0,city_id,city,latd,longd,elevation_m,elevation_ft,population_total,area_total_sq_mi,area_land_sq_mi,area_water_sq_mi,area_total_km2,area_land_km2,area_water_km2,area_water_percent
0,0,Adelanto,34.576111,-117.432778,875.0,2871.0,31765,56.027,56.009,0.018,145.107,145.062,0.046,0.03
1,1,AgouraHills,34.153333,-118.761667,281.0,922.0,20330,7.822,7.793,0.029,20.26,20.184,0.076,0.37
2,2,Alameda,37.756111,-122.274444,,33.0,75467,22.96,10.611,12.349,59.465,27.482,31.983,53.79
3,3,Albany,37.886944,-122.297778,,43.0,18969,5.465,1.788,3.677,14.155,4.632,9.524,67.28
4,4,Alhambra,34.081944,-118.135,150.0,492.0,83089,7.632,7.631,0.001,19.766,19.763,0.003,0.01


Let's ask a few questions about the data:
• What city has the highest and lowest elevation? 
• What city has the highest and lowest population?
• What city has the greatest total area?

We'll start by looking at the cities with the highest and lowest elevation. We'll do this by sorting the data according to the elevation column. Although the data set provides the elevation in both meters and feet, we're going to use feet because that column has fewer null values. 

Here are the five cities with the highest elevation:

In [4]:
calidata[["city", "elevation_ft"]].sort_values(by=["elevation_ft"], ascending=False).head()

Unnamed: 0,city,elevation_ft
246,MammothLakes,7880.0
39,BigBearLake,6752.0
416,SouthLakeTahoe,6237.0
436,Truckee,5817.0
242,Loyalton,4951.0


Here are the five cities with the lowest elevation: 

In [5]:
calidata[["city", "elevation_ft"]].sort_values(by=["elevation_ft"], ascending=True).head()

Unnamed: 0,city,elevation_ft
57,Calipatria,-180.0
463,Westmorland,-164.0
45,Brawley,-112.0
81,Coachella,-66.0
187,Imperial,-59.0


Let's move on to the cities with the highest and lowest population. We'll do this in the same way we analyzed the elevation of the cities, by sorting the data according to the population column. 

Here are the five cities with the highest population:

In [6]:
calidata[["city", "population_total"]].sort_values(by=["population_total"], ascending=False).head()

Unnamed: 0,city,population_total
239,LosAngeles,3884307
367,SanDiego,1345895
375,SanJose,1000536
370,SanFrancisco,837442
150,Fresno,509039


Here are the five cities with the lowest population:

In [7]:
calidata[["city", "population_total"]].sort_values(by=["population_total"], ascending=True).head()

Unnamed: 0,city,population_total
327,Pomona,1
448,Vernon,112
7,AmadorCity,185
190,Industry,219
366,SandCity,334


But something doesn't seem quite right here. Is it really possible that the city of Pomona California has a population of 1? According to Google, the population of Pomona, CA is 151,348. That's a pretty big difference. Let's correct that value, and then reexamine the data. 

In [8]:
calidata.loc[(calidata.city== "Pomona"), "population_total"] = 151348

In [9]:
calidata[["city", "population_total"]].sort_values(by=["population_total"], ascending=True).head()

Unnamed: 0,city,population_total
448,Vernon,112
7,AmadorCity,185
190,Industry,219
366,SandCity,334
435,Trinidad,367


How about Vernon California? According to Google, their population is 114, as of 2013. Since the data set we're working with does not provide a year, it's not unreasonable to presume that this data, though possibly slightly outdated, is correct. 

Finally, let's examine the cities with the greatest total area:

In [10]:
calidata[["city", "area_total_sq_mi"]].sort_values(by=["area_total_sq_mi"], ascending=False).head()

Unnamed: 0,city,area_total_sq_mi
239,LosAngeles,503.0
367,SanDiego,372.4
370,SanFrancisco,231.89
55,CaliforniaCity,203.631
375,SanJose,179.97


A quick google search confirms that this is correct. 

# NYC Cause of Death Data Set

Let's move on to the Academic Journal data set. This data set contains information about the leading causes of death in New York City, by race and gender. 

In [11]:
deathdata = pd.read_csv("C:/Users/Jessica/Desktop/John's School/IS362/IS362_Project2/nyc_death_data.csv")
deathdata

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
0,2014,Diabetes Mellitus (E10-E14),F,Other Race/ Ethnicity,11,.,.
1,2011,Cerebrovascular Disease (Stroke: I60-I69),M,White Non-Hispanic,290,21.7,18.2
2,2008,Malignant Neoplasms (Cancer: C00-C97),M,Not Stated/Unknown,60,.,.
3,2010,Malignant Neoplasms (Cancer: C00-C97),F,Hispanic,1045,85.9,98.5
4,2012,Cerebrovascular Disease (Stroke: I60-I69),M,Black Non-Hispanic,170,19.9,23.3
5,2007,Mental and Behavioral Disorders due to Use of ...,M,Not Stated/Unknown,.,.,.
6,2011,All Other Causes,F,Not Stated/Unknown,14,.,.
7,2007,Chronic Lower Respiratory Diseases (J40-J47),F,Black Non-Hispanic,163,15.5,14.8
8,2012,Essential Hypertension and Renal Diseases (I10...,F,Hispanic,101,8.2,9.5
9,2009,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Hispanic,1382,123.1,227.9


As we can see, this data has a lot of null values. Unfortunately, instead of leaving the fields blank, there is a "." in those fields. First we will remove those characters so that the null values are reflected correctly. 

In [32]:
deathdata = deathdata.replace('.', np.nan)

In my initial analysis of the data, I noticed that the numerical values were sorting in a strange way, with 30 coming up as smaller than 4000. I realized this is because the numbers were not being read correctly. 

In [33]:
deathdata[['Deaths']] = deathdata[['Deaths']].apply(pd.to_numeric)
deathdata.dtypes

Year                         int64
Leading Cause               object
Sex                         object
Race Ethnicity              object
Deaths                     float64
Death Rate                  object
Age Adjusted Death Rate     object
dtype: object

Now let's ask some questions about this data:
• What was the leading cause of death for each demographic in 2007?
• What was the leading cause of death for each demographic in 2014?

Leading cause of death for Asian and Pacific Islanders in 2007:

In [34]:
deathdata[(deathdata['Year']==2007) & (deathdata['Race Ethnicity']=='Asian and Pacific Islander')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
781,2007,Malignant Neoplasms (Cancer: C00-C97),M,Asian and Pacific Islander,528.0,109.1,145.9
364,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Asian and Pacific Islander,496.0,102.5,157.8
404,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Asian and Pacific Islander,428.0,83.1,115.2
741,2007,Malignant Neoplasms (Cancer: C00-C97),F,Asian and Pacific Islander,395.0,76.7,88.4
483,2007,All Other Causes,M,Asian and Pacific Islander,221.0,45.7,61.2


Leading cause of death for Black Non-Hispanic in 2007:

In [35]:
deathdata[(deathdata['Year']==2007) & (deathdata['Race Ethnicity']=='Black Non-Hispanic')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
447,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Black Non-Hispanic,2722.0,258.6,247.3
65,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Black Non-Hispanic,2121.0,248.9,340.2
461,2007,Malignant Neoplasms (Cancer: C00-C97),F,Black Non-Hispanic,1800.0,171.0,161.7
811,2007,Malignant Neoplasms (Cancer: C00-C97),M,Black Non-Hispanic,1523.0,178.7,229.4
715,2007,All Other Causes,F,Black Non-Hispanic,1230.0,116.8,113.4


Leading cause of death for Hispanics in 2007:

In [36]:
deathdata[(deathdata['Year']==2007) & (deathdata['Race Ethnicity']=='Hispanic')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
356,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Hispanic,1418.0,121.5,164.3
837,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Hispanic,1327.0,121.7,238.4
1066,2007,Malignant Neoplasms (Cancer: C00-C97),M,Hispanic,1013.0,92.9,163.5
235,2007,All Other Causes,M,Hispanic,988.0,90.6,142.6
366,2007,Malignant Neoplasms (Cancer: C00-C97),F,Hispanic,969.0,83.0,100.7


Leading cause of death for Asian and Unknown Races in 2007:

In [37]:
deathdata[(deathdata['Year']==2007) & (deathdata['Race Ethnicity']=='Not Stated/Unknown')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
618,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Not Stated/Unknown,86.0,,
878,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Not Stated/Unknown,82.0,,
769,2007,Malignant Neoplasms (Cancer: C00-C97),F,Not Stated/Unknown,51.0,,
40,2007,Malignant Neoplasms (Cancer: C00-C97),M,Not Stated/Unknown,45.0,,
716,2007,All Other Causes,M,Not Stated/Unknown,43.0,,


Leading cause of death for Other in 2007:

In [38]:
deathdata[(deathdata['Year']==2007) & (deathdata['Race Ethnicity']=='Other Race/ Ethnicity')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
637,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Other Race/ Ethnicity,43.0,,
830,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Other Race/ Ethnicity,36.0,,
686,2007,Malignant Neoplasms (Cancer: C00-C97),M,Other Race/ Ethnicity,29.0,,
1086,2007,All Other Causes,M,Other Race/ Ethnicity,24.0,,
255,2007,Malignant Neoplasms (Cancer: C00-C97),F,Other Race/ Ethnicity,22.0,,


Leading cause of death for White Non-Hispanic in 2007:

In [39]:
deathdata[(deathdata['Year']==2007) & (deathdata['Race Ethnicity']=='White Non-Hispanic')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
833,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,White Non-Hispanic,7050.0,491.4,250.7
657,2007,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,White Non-Hispanic,5632.0,421.0,350.7
476,2007,Malignant Neoplasms (Cancer: C00-C97),F,White Non-Hispanic,3518.0,245.2,167.4
79,2007,Malignant Neoplasms (Cancer: C00-C97),M,White Non-Hispanic,3356.0,250.9,213.7
272,2007,All Other Causes,M,White Non-Hispanic,1749.0,130.7,115.9


Let's examine if and how these numbers changed by 2014:

Leading cause of death for Asian and Pacific Islanders in 2014:

In [40]:
deathdata[(deathdata['Year']==2014) & (deathdata['Race Ethnicity']=='Asian and Pacific Islander')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
210,2014,Malignant Neoplasms (Cancer: C00-C97),M,Asian and Pacific Islander,657.0,114.5,129.5
988,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Asian and Pacific Islander,554.0,96.5,118.5
344,2014,Malignant Neoplasms (Cancer: C00-C97),F,Asian and Pacific Islander,502.0,80.2,80.6
567,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Asian and Pacific Islander,462.0,73.8,81.1
54,2014,All Other Causes,M,Asian and Pacific Islander,424.0,73.9,90.4


Leading cause of death for Black Non-Hispanic in 2014:

In [41]:
deathdata[(deathdata['Year']==2014) & (deathdata['Race Ethnicity']=='Black Non-Hispanic')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
819,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Black Non-Hispanic,2194.0,209.1,169.1
303,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Black Non-Hispanic,1958.0,226.8,264.7
231,2014,Malignant Neoplasms (Cancer: C00-C97),F,Black Non-Hispanic,1852.0,176.5,148.4
810,2014,All Other Causes,F,Black Non-Hispanic,1536.0,146.4,126.4
278,2014,Malignant Neoplasms (Cancer: C00-C97),M,Black Non-Hispanic,1532.0,177.5,199.6


Leading cause of death for Hispanics in 2014:

In [42]:
deathdata[(deathdata['Year']==2014) & (deathdata['Race Ethnicity']=='Hispanic')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
517,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Hispanic,1281.0,107.3,170.5
908,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Hispanic,1230.0,97.1,106.7
304,2014,All Other Causes,M,Hispanic,1195.0,100.1,143.3
526,2014,Malignant Neoplasms (Cancer: C00-C97),F,Hispanic,1154.0,91.1,97.4
271,2014,Malignant Neoplasms (Cancer: C00-C97),M,Hispanic,1146.0,96.0,143.5


Leading cause of death for Asian and Unknown Races in 2014:

In [43]:
deathdata[(deathdata['Year']==2014) & (deathdata['Race Ethnicity']=='Not Stated/Unknown')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
277,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Not Stated/Unknown,115.0,,
127,2014,All Other Causes,M,Not Stated/Unknown,96.0,,
603,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Not Stated/Unknown,95.0,,
355,2014,Malignant Neoplasms (Cancer: C00-C97),M,Not Stated/Unknown,73.0,,
314,2014,Malignant Neoplasms (Cancer: C00-C97),F,Not Stated/Unknown,50.0,,


Leading cause of death for Other in 2014:

In [44]:
deathdata[(deathdata['Year']==2014) & (deathdata['Race Ethnicity']=='Other Race/ Ethnicity')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
264,2014,All Other Causes,M,Other Race/ Ethnicity,74.0,,
612,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,Other Race/ Ethnicity,68.0,,
26,2014,Malignant Neoplasms (Cancer: C00-C97),F,Other Race/ Ethnicity,66.0,,
898,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,Other Race/ Ethnicity,63.0,,
251,2014,All Other Causes,F,Other Race/ Ethnicity,59.0,,


Leading cause of death for White Non-Hispanic in 2014:

In [45]:
deathdata[(deathdata['Year']==2014) & (deathdata['Race Ethnicity']=='White Non-Hispanic')].sort_values('Deaths', ascending=False).head()

Unnamed: 0,Year,Leading Cause,Sex,Race Ethnicity,Deaths,Death Rate,Age Adjusted Death Rate
620,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",F,White Non-Hispanic,4507.0,318.0,161.0
633,2014,"Diseases of Heart (I00-I09, I11, I13, I20-I51)",M,White Non-Hispanic,3990.0,297.1,238.4
953,2014,Malignant Neoplasms (Cancer: C00-C97),F,White Non-Hispanic,3153.0,222.5,150.2
607,2014,Malignant Neoplasms (Cancer: C00-C97),M,White Non-Hispanic,3142.0,234.0,195.1
726,2014,All Other Causes,F,White Non-Hispanic,2578.0,181.9,110.7


# International Student Data

This data set has a lot of information about international students, including their country of origin, their gender, their age, their dominant hand, their height, shoe size, and armspan, how many languages they speak, how they get to school, and how long it takes them to do so, their reaction time, and their favorite sport. 

In [46]:
studentdata = pd.read_csv("C:/Users/Jessica/Desktop/John's School/IS362/IS362_Project2/student_data.csv", encoding='latin-1')
studentdata

Unnamed: 0,Country,Region,Gender,Age-years,Handed,Height_cm,Footlength_cm,Armspan_cm,Languages_spoken,Travel_to_School,Travel_time_to_School,Reaction_time,Score_in_memory_game,Favourite_physical_activity,Importance_reducing_pollution,Importance_recycling_rubbish,Importance_conserving_water,Importance_saving_enery,Importance_owning_computer,Importance_Internet_access
0,Australia,New South Wales,Female,10,Right-handed,0.0,0.00,0.0,2.0,Car,10.0,0.660,56.0,Netball,1000.0,947.0,1000.0,,1000.0,947.0
1,Australia,Victoria,Female,14,Ambidextrous,174.0,25.00,176.0,1.0,Car,15.0,0.420,31.0,Netball,220.0,509.0,610.0,214.0,1000.0,1000.0
2,New Zealand,Auckland,Female,14,Left-handed,170.0,26.00,160.0,2.0,Bus,25.0,0.375,,Athletics,1000.0,,1000.0,,,
3,UK,South West,Female,15,Right-handed,163.0,22.00,154.0,1.0,Walk,12.0,0.330,41.0,Football/Soccer,871.0,998.0,1000.0,1000.0,871.0,869.0
4,Australia,New South Wales,Female,16,Right-handed,171.0,22.00,162.0,2.0,Bus,40.0,2.470,31.0,Golf,1000.0,941.0,941.0,911.0,1000.0,1000.0
5,Australia,Western Australia,Male,11,Right-handed,146.0,22.00,138.0,1.0,Car,15.0,0.690,36.0,Rugby Union,7.0,111.0,25.0,41.0,539.0,151.0
6,Canada,Alberta,Female,12,Right-handed,156.5,21.00,161.0,1.0,Car,15.0,0.341,37.0,Swimming,824.0,278.0,91.0,219.0,941.0,968.0
7,Australia,Queensland,Male,15,Right-handed,184.0,30.00,184.0,1.0,Bus,25.0,0.510,27.0,,,,976.0,,982.0,1000.0
8,Canada,Ontario,Female,14,Right-handed,160.0,22.00,166.5,2.0,Bus,35.0,0.391,31.0,Bowling,1000.0,1000.0,1000.0,1000.0,503.0,599.0
9,UK,West Midlands,Male,13,Right-handed,171.0,26.25,171.0,1.0,Walk,20.0,14.080,51.0,Football/Soccer,365.0,723.0,756.0,711.0,1000.0,1000.0


There are two problems with this data: one is that some of the values appear to be missing, or incorrect. Another is that there is so much data contained in this data set that it becomes difficult to focus in on the specifics. In order to focus in on certain areas of the data, let's ask targeted questions about this data that focuses on more narrow criteria:
• Are more students right handed or left handed?
• What is the most popular sport among students of each country?
• What is the average time it takes a student to get to school in each country?
• In which country are students most concerned about reducing pollution?

Are more students right handed or left handed?

In [47]:
studentdata.groupby('Country')['Handed'].value_counts()

Country      Handed      
Australia    Right-handed    141
             Left-handed      16
             Ambidextrous     13
Canada       Right-handed     88
             Ambidextrous      8
             Left-handed       4
New Zealand  Right-handed     61
             Left-handed       9
             Ambidextrous      4
UK           Right-handed     46
             Left-handed       9
             Ambidextrous      1
USA          Right-handed     89
             Left-handed       7
             Ambidextrous      3
Name: Handed, dtype: int64

In all five of the countries surveyed, there are more right handed students than left handed students. 

What is the most popular sport among students of each country?

In [48]:
studentdata.groupby('Country')['Favourite_physical_activity'].value_counts()

Country    Favourite_physical_activity
Australia  Football/Soccer                33
           Netball                        22
           Other activities/sports        22
           Dancing                        14
           Basketball                     10
           Hockey (Field)                 10
           Tennis                          9
           Rugby Union                     6
           Cricket                         5
           Swimming                        5
           Gymnastics                      4
           Martial arts                    4
           Rugby League                    4
           Athletics                       3
           Baseball/Softball               3
           Cycling                         3
           Table Tennis                    3
           Walking/Hiking                  3
           Skateboarding/Rollerblading     2
           Bowling                         1
           Golf                            1
           Hocke

What is the average time it takes a student to get to school in each country?

In [49]:
studentdata.groupby('Country')['Travel_time_to_School'].mean().sort_values(ascending=False)

Country
UK             19.964286
New Zealand    17.297297
Canada         16.950000
USA            16.885417
Australia      15.923529
Name: Travel_time_to_School, dtype: float64

Students in the UK have the longest commute to school on average, at nearly 20 minutes. Students in Australia have the shortest commute on average, with nearly 16 minutes. 

In which country are students most concerned about reducing pollution?

In [50]:
studentdata.groupby('Country')['Importance_reducing_pollution'].mean().sort_values(ascending=False)

Country
USA            886.375000
Australia      712.538922
New Zealand    702.702703
Canada         696.480000
UK             578.196429
Name: Importance_reducing_pollution, dtype: float64

According to the importance placed on reducing pollution, on average, students in the US are the most concerned about reducing pollution. Students in the UK are the least concerned about this issue. 