# Module 1 - Basic Python and Pandas (30 minutes)

* The course is hands-on, so please work along with us ...
* Each module has an exercise at the end to reinforce learning
* Please feel free to ask questions at any time!

We really need a good understanding of **Pandas** basics because it works so well with MongoDB.

In [None]:
# upload to Jupyter and get data into Pandas

import pandas as pd

f = 'data/cars.csv'
df = pd.read_csv(f)

# View a sample from the DataFrame

In [None]:
# view a sample of records from a Pandas DataFrame

df.head()

# Get the shape of the data

In [None]:
# get shape of DataFrame

df.shape  # 406 records with 9 column features

# Limit the size of the sample viewed

In [None]:
# view just one record (just put in the number you want to view)

n = 1
df.head(n)

# View samples from the end of the DataFrame

In [None]:
# view last 5 records

df.tail()

In [None]:
# view last n records

n = 3

df.tail(n)

# View a sample restricted by fields

In [None]:
# use a list to hold column feature names

n = 3
ls = ['Car', 'Model', 'Origin', 'MPG']
df[ls].head(3)

In [None]:
# alternatively ...

df[['Car', 'Model', 'Origin', 'MPG']].head(3)

# Get the features (fields)

In [None]:
# get feature columns

features = list(df)
features

# Get record counts

In [None]:
# count of all records

df.count()

In [None]:
# count cars from each country by cylinder

cyl_by_origin = df.groupby(['Origin', 'Cylinders'])['Car'].count()
cyl_by_origin

This may seem a bit confusing because you can put any field name from the DataFrame in the brackets '[]'. The reason is that once we group by 'Origin' and then group by 'Cylinders', we count the results. And, the results are counts of cylinder by origin. Pandas has to put the results somewhere. But, it does give you the cylinder count by origin. That is, US cars have 72 cars with 4 cylinders, 74 cars with 6 cylinders, and 108 cars with 8 cylinders.    

# Get unique values from a field

In [None]:
# view unique cyliner types

cylinders = df['Cylinders'].unique()
cylinders

# Run some simple analytics on the data

In [None]:
# find the average weight of all cars (2 decimal places)

import numpy as np

weight = df['Weight']
avg_weight = np.round(weight.mean(axis = 0), 2)  # axis 0 means to view by column
avg_weight

In [None]:
# find the average MPG for all cars

mpg = df['MPG']
avg_mpg = np.round(mpg.mean(axis = 0), 2)
avg_mpg

In [None]:
# heck, we can get the mean of all features!

avg = np.round(df.mean(), 2)
avg

In [None]:
# for stats crazies, we can get the standard deviation!

mpg = df['MPG']
sd_mpg = np.round(mpg.std(), 2)
sd_mpg

# Slice and Dice

1. Slice rows and dice columns
2. Pandas is great for slicing and dicing data.
3. The 'iloc' method offers an easy way to slice and dice

In [None]:
# slice rows 5, 6, 7, 8, and 9

row_slice = df.iloc[5:10]
row_slice

In [None]:
# slice rows 0, 1, and 2 and dice MPG and Cylinders features for those rows

col_slice = df.iloc[:, 1:3].head(3)
col_slice

In [None]:
# slice rows 0, 5, and 8
# slice cols 1, 0, and 5

slice1 = df.iloc[[0, 5, 8], [1, 0, 5]]
display (slice1)

# slice rows 5, 6, and 7
# slice cols 1, 0, 5, and 7

slice2 = df.iloc[5:8, [1, 0, 5, 7]]
display(slice2)

# Let's just analyze US cars

In [None]:
# build DataFrame for US cars only

us_filter =  df['Origin'] == 'US'
us = df[us_filter]
us.head()

In [None]:
# get shape of new US DataFrame

us.shape  # 254 records with 9 feature columns

In [None]:
# get averages for US cars

avg_us = np.round(us.mean(), 2)
avg_us

In [None]:
# get average MPG for US cars

avg_us_mpg = np.round(us['MPG'].mean(), 2)
avg_us_mpg

# Let's analyze European cars

In [None]:
# get European cars with 5 cylinders (a bit more complex but useful)

# first, get the column you wish to analyze

n = 3
europe_filter = df['Origin'] == 'Europe'
europe = df[europe_filter]
europe.head(n)

In [None]:
# only 73 cars are European

europe.shape

In [None]:
# second, get the rows (records) with the column of interest

filter_cyl = europe['Cylinders'] == 5
cyl5 = europe[filter_cyl]
cyl5.head()  # why do we only get 3 records?

In [None]:
# only 3 European cars have 5 cylinders!

cyl5.shape

In [None]:
# verify that it worked from original DataFrame

df[['Car', 'MPG', 'Cylinders']].iloc[281]

In [None]:
# chaining is a bit simpler

europe = df[df.Origin.eq('Europe')]
cyl5 = europe[europe.Cylinders.eq(5)]
cyl5.head()

In [None]:
#To select rows whose column value is in list

num = [3, 5]
cyl= df[df.Cylinders.isin(num)]
cyl  # print ALL results!

In [None]:
# verify shape

cyl.shape

# Module 1 Exercise

* Compare the 'MPG' for cars across all unique origins.
* Does this tell us anything of interest?
* Can we further analyze the data by digging a bit deeper?

# Our Solution

We suggest that you break the problem down into simple pieces.

1. Identify all possible origins.
2. Create DataFrames for each origin.
3. Find average 'MPG' for each origin and compare
4. Break down further to see cylinders and counts by origin

In [None]:
# Step 1 - get unique origins

origins = df['Origin'].unique()
origins

In [None]:
# Step 2 - Create DataFrames for each origin

us = df[df.Origin.eq('US')]
europe = df[df.Origin.eq('Europe')]
japan = df[df.Origin.eq('Japan')]

In [None]:
# Always verify your results!

display (us.tail(n))
display (europe.tail(n))
display (japan.tail(n))

In [None]:
# Step 3 - Find average 'MPG'

avg_us = np.round(us['MPG'].mean(), 2)
avg_europe = np.round(europe['MPG'].mean(), 2)
avg_japan = np.round(japan['MPG'].mean(), 2)


print ('US:', avg_us)
print ('Europe', avg_europe)
print ('Japan', avg_japan)

It is very easy to notice that US cars get the worst 'MPG' and Japanese cares get the best. But, can we identify a reason? 

In [None]:
# Step 4, part 1 - Find cylinders by origin

us_cyl = us['Cylinders'].unique()
europe_cyl = europe['Cylinders'].unique()
japan_cyl = japan['Cylinders'].unique()

print ('US:', us_cyl)
print ('Europe', europe_cyl)
print ('Japan', japan_cyl)

The reason might be that US cars use more gas because many of them have 8 cylinders. But, we still might want to dig a bit deeper. Let's look at the record counts and cylinders by origin.

In [None]:
# Step 4, part 2 - Find record count by origin

print ('US:', us.shape)
print ('Europe:', europe.shape)
print ('Japan:', japan.shape)

We do have a bit of an issue that is very common in data analytics. The data is **unbalanced**. That is, record counts vary quite a bit between US cars and the other two origins. So, our results may be inconclusive. Of course, this data is not that badly out of balance. Another issue is that we don't have much data. Our decisions are better informed with a lot of data. What can we do? We can get more data! That is why companies are always trying to get as much data as they can.

In [None]:
# record count by cylinder

cyl_by_us = us.groupby(['Origin', 'Cylinders'])['Car'].count()
print (cyl_by_us)

cyl_by_europe = europe.groupby(['Origin', 'Cylinders'])['Car'].count()
print (cyl_by_europe)

cyl_by_japan = japan.groupby(['Origin', 'Cylinders'])['Car'].count()
print (cyl_by_japan)

From this small dataset, we are confident that US cars are less efficient than European or Japanese cars because 8 cylinder cars generally get really bad 'MPG', and the majority of US cars in the data are of the 8 cylinder variety (108 to be exact).  

# Whew! We did a lot in a very short amount of time

1. we read a CSV file into a Pandas DataFrame
2. we manipulated the DataFrame in many ways
3. we worked through an exercise to sharpen our skills

## Questions?

# <font color=red>5 minute break</font>