# Lab - Week 1

*Assigned: Thurs. 9/3*  
**DUE: Tues. 9/8 @ 5:00pm**

**Name: `Hunter Chambers`**

Assignment Objectives:  
Upon successful completion of this assignment, a student will be able to: 

* Add new text and code cells to a colab notebook
* Gain experience in formatting text using Markdown 
* Load in a data set and explore its properties (missing data, statistics, basic visualizations). 


In [None]:
#  Import libraries 
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

*Tip: It is good practice to list all imports needed at the top of the notebook.  You can import modules in later cells as needed, but listing them at the top clearly shows all which are needed to be available / installed.*

## Example 1 - More Data Cleaning 
*Adapted from J. Sullivan*

Let's look at another data file to see additional data cleaning steps and code.  

The initial data set reads in part: 

![property data](https://pages.mtu.edu/~lebrown/un5550-f20/week1/property-data.jpg)

In [None]:
# Cell only needs to be run once to load the CSV to the local storage.
from google.colab import files

uploaded = files.upload()

Saving property.csv to property.csv


In [None]:
prop = pd.read_csv("property.csv")
prop

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3,1,1000
1,100002000.0,197.0,LEXINGTON,N,3,1.5,--
2,100003000.0,,LEXINGTON,N,,1,850
3,100004000.0,201.0,BERKELEY,12,1,,700
4,,203.0,BERKELEY,Y,3,2,1600
5,100006000.0,207.0,BERKELEY,Y,,1,800
6,100007000.0,,WASHINGTON,,2,HURLEY,950
7,100008000.0,213.0,TREMONT,Y,--,1,
8,100009000.0,215.0,TREMONT,Y,na,2,1800


We can see that Python / `pandas` is already able to find some of the different ways that we have missing values in the data.

For instance in the ST_NUM column, the 3rd entry is blank and the 7th entry is NaN.  `pandas` filled in the blank entry with "NA".  Both of these values are found by the `isnull()` method.

In [None]:
prop['ST_NUM'].isnull()

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8    False
Name: ST_NUM, dtype: bool

However, there are other missing value encodings that pandas does not immediately recognize. 

Let's look at the Num_Bedrooms column. 

![property data 2](https://pages.mtu.edu/~lebrown/un5550-f20/week1/property-data2.jpg)



In this column, we have missing values as "n/a", "NA", "--" and "na".

Let's see what `pandas` automatically recognizes.

In [None]:
prop['NUM_BEDROOMS'].isnull()

0    False
1    False
2     True
3    False
4    False
5     True
6    False
7    False
8    False
Name: NUM_BEDROOMS, dtype: bool

`pandas` automatically recognizes the "n/a" and "NA" but not the "--" and "na". 

Let's change that! 

In [None]:
# Making a list of missing value types
missing_values = ["n/a", "na", "--", "NA"]
prop2 = pd.read_csv("property.csv", na_values = missing_values)
prop2

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3.0,1,1000.0
1,100002000.0,197.0,LEXINGTON,N,3.0,1.5,
2,100003000.0,,LEXINGTON,N,,1,850.0
3,100004000.0,201.0,BERKELEY,12,1.0,,700.0
4,,203.0,BERKELEY,Y,3.0,2,1600.0
5,100006000.0,207.0,BERKELEY,Y,,1,800.0
6,100007000.0,,WASHINGTON,,2.0,HURLEY,950.0
7,100008000.0,213.0,TREMONT,Y,,1,
8,100009000.0,215.0,TREMONT,Y,,2,1800.0


In [None]:
print (prop2['NUM_BEDROOMS'])
print (prop2['NUM_BEDROOMS'].isnull())

# Exercises 

For this portion of the lab you are going to explore and learn about different basic functions in Python and use concepts covered in class.


## Exercise 1 - Printing

In many courses, tutorials for new languages the first thing you learn is printing "Hello World"

In [None]:
print('Hello World')

Hello World


We can also capture `input` from the user. 
https://docs.python.org/3/library/functions.html#input

In [None]:
firstName = input('what is your name?')

what is your name?Hunter


In [None]:
"Hello " + firstName + "!"

'Hello Hunter!'

Use inbuilt function `dir()` to the variable "firstName" above and print the outcome.

https://docs.python.org/3/library/functions.html#dir

In [None]:
dir(firstName)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

This lists all the functions available to be used on the "string" `firstName'

I want you to explore using the string functions: `len()`, `split()`, and `strip()` on the following string. 

https://docs.python.org/3/library/functions.html

In [None]:
className = "Introduction to Data Science   "

In [None]:
# Show how to find the length of the string - className
len(className)

31

In [None]:
# Show the results of the `split()` function on the string "className"
className.split()

['Introduction', 'to', 'Data', 'Science']

In [None]:
# Save the results of the `strip()` function on the string "className" into a 
#  new variable "className2"
className2 = className.strip()
className2

'Introduction to Data Science'

## Example 2 - Comments 

To create a comment line (in line with the code), # (hash) symbol is used, followed by a space. (Short key: Ctrl+/ ) [To comment out, remove # or use Ctrl+/ again]

Other options are using the triple quotes (""")or (''') known as backticks, to enclose the complete sentence as a comment.(This needs to be on different line other than the code). Different programming language has different approches for commenting. Please be aware.

In [None]:
# This is a comment

In [None]:
'''This is a larger comment block 
that may span multiple lines 
'''
# 2 + 2

## Exercise 2 - Markdown

Markdown option for cells in the jupyter notebook provides a way to display information to the use around the particular code snippets. For more information and reading, please look into:

https://help.github.com/articles/basic-writing-and-formatting-syntax/

https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html

Colab's Markdown Guide: https://colab.research.google.com/notebooks/markdown_guide.ipynb#scrollTo=5Y3CStVkLxqt

For this exercise add a new 'Text' cell and try to recreate the following block of text. 

![example markdown](https://pages.mtu.edu/~lebrown/un5550-f20/week1/markdown-example.png)



We can start with a few different paragraphs of text. This first paragraph will have a few sentences with various markkpus found. Things like **bold**, *italics*, ~~strikethrough~~, and even `monospace`.

Here is another paragraph of text that contains a url [https://mtu.edu](https://mtu.edu).

We can have lists:


*   one
*   two
*   three

And more lists:



1.   one
2.   two
3.   three

Nested Lists:

* one
  * one A
  * one B
* two
* three









## Example 3 - String Operations 

Here you can see some more operations working with strings.

In [None]:
str = "Hello Data Science 2019"

In [None]:
print(str.find("2019"))

19


In [None]:
print(str[-4:])

2019


In [None]:
str.upper()

'HELLO DATA SCIENCE 2019'

In [None]:
str.lower()

'hello data science 2019'

In [None]:
str + ' & ' + 'FutureDatascientist'

'Hello Data Science 2019 & FutureDatascientist'

## Exercise 3 - Pandas 

Pandas Resources:
* https://pandas.pydata.org/
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

We are going to be using the Auto+MPG data set.  This is part of the UCI Machine Learning repository.  A common place to find data sets to test out code and used in learning about machine learning and data science. 

1. Go to the repository information on this data set. https://archive.ics.uci.edu/ml/datasets/Auto+MPG
2. Download the data set and information files to your local machine. Note, the data is not in a `.csv` or `.tab` or `.txt` format.
3. Upload the file to be used in the Colab notebook. 
4. In the next cell, you will modify the code to read in the `auto-mpg.data` file properly.  Use the following names for the columns:  
mpg, cyl, disp, hp, wgt, acc, year, origin, name



In [None]:
# Cell only needs to be run once to load the CSV to the local storage.
from google.colab import files

uploaded = files.upload()

Saving auto-mpg.data to auto-mpg.data


In [None]:
df = pd.read_csv('auto-mpg.data', sep='\s+', names= ['mpg', 'cyl', 'disp', 'hp', 'wgt', 'acc', 'year', 'origin', 'name'])  # modify this code to properly read the data
# use the column names provided above 
df.head()

Unnamed: 0,mpg,cyl,disp,hp,wgt,acc,year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


*Tip: it may be helpful to look at the documentation on `read_csv`   
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

## Exercise 4 - Pandas 

Here you will explore properties of the DataFrame and its attributes.

In [None]:
# Determine the number of rows and columns of the data set 
print(df['name'].count()) # number of rows
len(df.columns)    # number of columns

398


9

What are the names of the columns? 

In [None]:
for col in df.columns:
  print(col)

mpg
cyl
disp
hp
wgt
acc
year
origin
name


## Exercise 5 - Pandas 

Show the first 3 rows of the DataFrame. 


In [None]:
df.head(3)

Unnamed: 0,mpg,cyl,disp,hp,wgt,acc,year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite


Show the last 8 rows of the DataFrame.

In [None]:
df.tail(8)

Unnamed: 0,mpg,cyl,disp,hp,wgt,acc,year,origin,name
390,32.0,4,144.0,96.0,2665.0,13.9,82,3,toyota celica gt
391,36.0,4,135.0,84.0,2370.0,13.0,82,1,dodge charger 2.2
392,27.0,4,151.0,90.0,2950.0,17.3,82,1,chevrolet camaro
393,27.0,4,140.0,86.0,2790.0,15.6,82,1,ford mustang gl
394,44.0,4,97.0,52.0,2130.0,24.6,82,2,vw pickup
395,32.0,4,135.0,84.0,2295.0,11.6,82,1,dodge rampage
396,28.0,4,120.0,79.0,2625.0,18.6,82,1,ford ranger
397,31.0,4,119.0,82.0,2720.0,19.4,82,1,chevy s-10


## Exercise 6 - Pandas 

Practice selecting different parts of the DataFrame

Select the horsepower column

In [None]:
# select just the horsepower column 
df['hp']

0      130.0
1      165.0
2      150.0
3      150.0
4      140.0
       ...  
393    86.00
394    52.00
395    84.00
396    79.00
397    82.00
Name: hp, Length: 398, dtype: object

Select both the cylinders and displacement columns.


In [None]:
df[['cyl', 'disp']]

Unnamed: 0,cyl,disp
0,8,307.0
1,8,350.0
2,8,318.0
3,8,304.0
4,8,302.0
...,...,...
393,4,140.0
394,4,97.0
395,4,135.0
396,4,120.0


## Exercise 7 - Pandas 

Select the 8th row of the DataFrame 


In [None]:
df.loc[7]

mpg                      14
cyl                       8
disp                    440
hp                    215.0
wgt                    4312
acc                     8.5
year                     70
origin                    1
name      plymouth fury iii
Name: 7, dtype: object

Select the 4th and 5th row 

In [None]:
df.loc[[3, 4]]

Unnamed: 0,mpg,cyl,disp,hp,wgt,acc,year,origin,name
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


Select every other row and column starting from row 3 and column 4

In [None]:
df.iloc[2::2, 3::2]

Unnamed: 0,hp,acc,origin
2,150.0,11.0,1
4,140.0,10.5,1
6,220.0,9.0,1
8,225.0,10.0,1
10,170.0,10.0,1
...,...,...,...
388,92.00,14.5,1
390,96.00,13.9,3
392,90.00,17.3,1
394,52.00,24.6,2


## Exercise 8 - Data Selection and Statistics 

Perform `mean()`, and `min()`  for first 10 data points. 

*Hint: remember df.head(10) returns the first 10 rows of the DataFrame*

In [None]:
df.head(10).mean()

mpg         15.6
cyl          8.0
disp       374.9
wgt       3879.7
acc         10.3
year        70.0
origin       1.0
dtype: float64

In [None]:
df.head(10).min()

mpg                       14
cyl                        8
disp                     302
hp                     130.0
wgt                     3433
acc                      8.5
year                      70
origin                     1
name      amc ambassador dpl
dtype: object

## Exercise 9 - Data Selection and Statistics 

Group by column "origin" and find the median for the other variables. 

In [None]:
df.groupby('origin').median()

Unnamed: 0_level_0,mpg,cyl,disp,wgt,acc,year
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,18.5,6,250.0,3365.0,15.0,76
2,26.5,4,104.5,2240.0,15.7,76
3,31.6,4,97.0,2155.0,16.4,78


## Exercise 10 - Data Selection and Statistics 

Find the mean performance (across all variables) of vehicles after 1975. 

In [None]:
df[(df['year'] > 75 )].mean()

mpg         26.952778
cyl          5.050926
disp       165.541667
wgt       2778.337963
acc         16.118519
year        78.935185
origin       1.671296
dtype: float64