# Lambda School Data Science - Unit 1 Sprint 1 Module 1

## Exploratory Data Analysis

### Module Learning Objectives


- Load a CSV dataset from a URL using `pandas.read_csv`
- Load a CSV dataset from a local file using `pandas.read_csv
- Use basic Pandas functions for Exploratory Data Analysis-EDA
- Describe and discriminate between basic data types such as categorical, quantitative, continuous, discrete, ordinal, nominal and identifier

### Notebook points: 12

## Autograded Module Projects

Welcome to the first Module Project of Unit 1! You will complete a project (sometimes also referred to as an assignment) after the Guided Project for each Module. There will be four Module Projects per Sprint and each project is designed to provide you with an opportunity to practice what you learned in the Canvas Warm-up material and the Guided Project with your Instructors.

Throughout Unit 1, the Module Projects and the Sprint Challenges are *autograded*. You will complete your work in a Jupyter/Python notebook (the files that end with `.ipynb`) and then upload your completed notebook to Canvas and submit for grading. This autograding process will check your answers and provide more information about the errors in your notebook. You'll receive a score when the testing is complete.

So, let's get started! If you are reading this notebook, then you have the correct autograded version.

## Introduction

For this module, we learned how to use some of the most common tools for exploring our data. We'll continue to practice our `pandas` skills with a new dataset. 

## Dataset Description

Explore the University of California - Irvine Adult Dataset

**Task 1** - Load a dataset via its URL

* Create a Python list named `column_headers` with the following items: `'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week','native-country', 'income'`.  Lists have the form `listname = ['element 1', 'element 2', element 3']`
* Load the file from the URL provided below using the `column_headers` list you created to name your columns: set the `pd.read_csv()` to `column_headers` using `names =`.  
* Name the DataFrame `adult`

In [27]:
# Task 1
import pandas as pd
import numpy as np

# URL for the dataset
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

### BEGIN SOLUTION

# Create the column headers list
column_headers = ['age',
                  'workclass',
                  'fnlwgt', 
                  'education',
                  'education-num',
                  'marital-status',
                  'occupation',
                  'relationship',
                  'race',
                  'sex',
                  'capital-gain',
                  'capital-loss',
                  'hours-per-week',
                  'native-country',
                  'income']

adult = pd.read_csv(data_url, names=column_headers)

### END SOLUTION

# Print out your DataFrame
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**Task 1 Test**

In [28]:
# Task 1 - Test

# These tests are for you to check your work before submitting
assert isinstance(column_headers, list), 'Have you created a list of header names?'
assert len(column_headers) == 15, 'Did you include the correct number of items in the column_headers list?'
assert isinstance(adult, pd.DataFrame), 'Have you created a DataFrame named adult?'

# These tests will be completed when you submit your notebook for autograding
### BEGIN HIDDEN TESTS
assert adult.shape == (32561, 15), 'Double check your DataFrame size.'
### END HIDDEN TESTS

**Task 2** - Look at the first and last rows of the DataFrame

* Assign the first **ten (10)** rows of the `adult` DataFrame to `adult_head`
* Assign the last **ten (10)** rows of the `adult` DataFrame to `adult_tail`

In [29]:
# Task 2

### BEGIN SOLUTION
adult_head = adult.head(10)
adult_tail = adult.tail(10)
### END SOLUTION

# Optional: print out adult_head and adult_tail

**Task 2 Test**

In [30]:
# Task 2 - Test

# These tests are for you to check your work before submitting
assert isinstance(adult_head, pd.DataFrame), 'Have you created a DataFrame named adult_head?'
assert isinstance(adult_tail, pd.DataFrame), 'Have you created a DataFrame named adult_tail?'

# These tests will be completed when you submit your notebook for autograding
### BEGIN HIDDEN TESTS
assert adult_head.shape == (10, 15), 'Double check your adult_head DataFrame size.'
assert adult_tail.shape == (10, 15), 'Double check your adult_tail DataFrame size.'
### END HIDDEN TESTS

**Task 3** - Variable data types

For your `adult` DataFrame, determine the data types for the variables.

* Count the number of `int64` variable types and assign to `number_int64` (your value should be an integer)
* Count the number of `object` variable types and assign to `number_object` (your value should be an integer)

There are different ways to find the data types of the columns in a DataFrame. All you need to do for this task is to count the number of each data type and assign to the corresponding variable.

In [31]:
# Task 3

### BEGIN SOLUTION
# Display the data types
adult.dtypes
number_int64 = 6
number_object = 9
### END SOLUTION

**Task 3 Test**

In [32]:
# Task 3 - Test

# Hidden tests - you will see the results when you submit to Canvas
### BEGIN HIDDEN TESTS
assert number_int64 == 6, 'Double check the number of int64 variables.'
assert number_object == 9, 'Double check the number of object variables.'
### END HIDDEN TESTS

**Task 4** - DataFrame dimensions

* Find the dimensions of your DataFrame and assign result to the variable `adult_dimension`. Your variable should be a *tuple* and the row dimension should be listed first.

Hint: A tuple looks like this: (1, 2) - a tuple is a collection which is ordered and unchangeable. You can use the function `type(your_variable)` to print the variable type.

Hint 2: The `shape` method returns a tuple - convenient!

In [33]:
# Task 4

### BEGIN SOLUTION
adult_dimension = adult.shape
### END SOLUTION

# Print out the shape
print(adult_dimension)

(32561, 15)


**Task 4 Test**

In [34]:
# Task 4 - Test

assert isinstance(adult_dimension, tuple), 'Have you created a tuple named adult_dimension?'

# Hidden tests - you will see the results when you submit to Canvas
### BEGIN HIDDEN TESTS
assert adult_dimension == (32561, 15), 'Check your dimensions and make sure you have rows listed first.'
### END HIDDEN TESTS

**Task 5** - Missing values

Are there any missing values in the dataset? Let's check!

* Check for missing values using `.isnull().sum()`
* Count the number of missing values and assign the value to the variable `adult_missing`. Your variable should be an integer.

In [35]:
# Task 5

### BEGIN SOLUTION
print(adult.isnull().sum())
# There are no missing values
adult_missing = 0
### END SOLUTION

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


**Task 5 Test**

In [36]:
# Task 5 - Test

# Hidden tests - you will see the results when you submit to Canvas
### BEGIN HIDDEN TESTS
assert adult_missing == 0, 'Check you missing values sum - there should be no missing values.'
### END HIDDEN TESTS

**Task 6** - Viewing DataFrame statistics

Let's look at some of the *summary statistics* of our dataset. We can use the .describe() function in order to see the summary statistics of the numeric columns.

Look at the statistics for the `adult` DataFrame and then complete the following two tasks:

* Find the value for the mean `age` and assign it to the variable `mean_age` (your value should be a float and defined to two decimal places)
* Find the standard deviation (std) for the `hours-per-week` variable and assign it to `std_hpw` (your value should be a float and defined to two decimal places)

Are there any values shown that might be a code for missing data?  We'll learn how to change the values in a DataFrame in the next module.

In [37]:
# Task 6

### BEGIN SOLUTION

# View the statistics
adult.describe()

# Answer:
# The values of 99999 for capital-gain
# and 99 for hours-per-week are code for missing values

mean_age = 38.58
std_hpw = 12.35

### END SOLUTION

**Task 6 Test**

In [38]:
# Task 6 - Test

# Hidden tests - you will see the results when you submit to Canvas
### BEGIN HIDDEN TESTS
assert round(mean_age) == 39, 'Check the results you read from your describe table'
assert round(std_hpw) == 12, 'Check the results you read from your describe table'
### END HIDDEN TESTS

**Task 7** - Summary statistics for non-numeric columns

We have some columns in this dataset that are non-numeric or object columns. This is usually a string object. Let's use the `describe()` function again but include the argument `exclude='number'` to exclude the numeric columns.

Using the results of `describe(exclude='number')`, complete the following two tasks:

* Find the number of unique `education` values and assign it to `unique_edu` (your value should be an integer)
* Find the number of times the most frequent observation for `income` occurs and assign it to `freq_income` (your value should be an integer) *(Note: this is not the income value itself, just how many times it occurs)*

In [39]:
# Task 7

### BEGIN SOLUTION

# View the non-numeric statistics
adult.describe(exclude='number')

unique_edu = 16
freq_income = 24720

### END SOLUTION

In [40]:
adult.describe(exclude='number')
adult['income'].value_counts()

 <=50K    24720
 >50K      7841
Name: income, dtype: int64

**Task 7 Test**

In [41]:
# Task 7 - Test

# Hidden tests - you will see the results when you submit to Canvas
### BEGIN HIDDEN TESTS
assert round(unique_edu) == 16, 'Check the results from your describe table'
assert round(freq_income) == 24720, 'Check the results from your describe table'
### END HIDDEN TESTS

**Task 8** - Finding value counts

Let's look more specifically at the `relationship` column and perform a count of the number of observations for each category. We can see how many categories we have by using the `.unique()` method on the column. Then, we can use `.value_counts()` to count the number of observations in each category.

* View the unique values in the `relationship` column with `.unique()`
* View the number of observations for each value with `.value_counts()`
* Find the counts for `Other-relative` and assign to the variable `adult_other_rel`

In [42]:
# Task 8

### BEGIN SOLUTION

# View the non-numeric statistics
adult['relationship'].unique()
adult['relationship'].value_counts()

adult_other_rel = 981

### END SOLUTION

**Task 8 Test**

In [43]:
# Task 8 - Test

# Hidden tests - you will see the results when you submit to Canvas
### BEGIN HIDDEN TESTS
assert round(adult_other_rel) == 981, 'Did you select the correct observation?'
### END HIDDEN TESTS

**Task 9** - Create a Series

A column of a DataFrame is a pandas Series. Using the `adult` DataFrame, create a Series from the `occupation` column.

* Create a Series from the `occupation` column and name it `adult_occup`

In [44]:
# Task 9

### BEGIN SOLUTION
adult_occup = adult['occupation']
### END SOLUTION

**Task 9 Test**

In [45]:
# Task 9 - Test

# These tests are for you to check your work before submitting
assert isinstance(adult_occup, pd.Series), 'Have you created the correct Series?'

# Hidden tests - you will see the results when you submit to Canvas
### BEGIN HIDDEN TESTS
assert adult_occup.name == 'occupation', 'Double check that you have the correct column.'
### END HIDDEN TESTS

**Task 10** - Practice with a new dataset

Let's use some of what we've learned so far to load in a new dataset and answer some questions. For now, we're going to read in the data from the website where it is stored. But, at the end of this project, you can review the instructions for uploading a file to Google Colab,

* Use [this link (right click to copy)](https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/NASCAR/nascard.csv) as the URL and assign it to the variable `data_url2`
* Read in the data into a DataFrame called `nascar`; the CSV already includes a header

More information about this data can be found [here](https://github.com/LambdaSchool/data-science-practice-datasets/tree/main/unit_1/NASCAR).

In [46]:
# Task 10

### BEGIN SOLUTION
data_url2 = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/NASCAR/nascard.csv'
nascar = pd.read_csv(data_url2)
### END SOLUTION

**Task 10 Test**

In [47]:
# Task 10 - Test

# These tests are for you to check your work before submitting
assert isinstance(nascar, pd.DataFrame), 'Have you created the nascar DataFrame?'

# Hidden tests - you will see the results when you submit to Canvas
### BEGIN HIDDEN TESTS
assert nascar.shape == (34884, 10), 'Double check your DataFrame size.'
### END HIDDEN TESTS

**Task 11** - Look at the header

Now, let's look at the data. We'll use the `.head()` method to view the first rows of the DataFrame

* Assign the output from `.head()` to a DataFrame called `nascar_head` (use the default values of the method)

In [48]:
# Task 11

### BEGIN SOLUTION
nascar_head = nascar.head()
### END SOLUTION

**Task 11 Test**

In [49]:
# Task 11 - Test

# These tests are for you to check your work before submitting
assert isinstance(nascar_head, pd.DataFrame), 'Have you created the nascar_head DataFrame?'

# Hidden tests - you will see the results when you submit to Canvas
### BEGIN HIDDEN TESTS
assert nascar_head.shape == (5, 10), 'Double check your DataFrame size; you should have five rows.'
### END HIDDEN TESTS

**Task 12** - Dataset description and variable types

Using your `nascar` DataFrame, answer the following questions. Remember - you can use methods like `describe()` and `info()` to print out information about the variable types in your DataFrame. Remember, you can use `describe(exclude='number')` to view non-numeric variable types.

You can view more information about the data [here](http://users.stat.ufl.edu/~winner/data/nascard.txt).

Select the letter indicating which one of the following statements about the features in `nascar` is incorrect?  For example, if statement B is incorrect, you'll type your answer in the code block as `answer = 'B'`


A: Each entry for `driver` is a text string

B: `carMake` is an identifer variable


C: `carsRace` is a quantitative, discrete variable


D: `prize` is a quantitative, continuous variable

In [50]:
# Task 12

# Ignore the YOUR CODE HERE for this cell
### BEGIN SOLUTION
answer = 'B'
### END SOLUTION

In [51]:
# Task 12 - Test

# Hidden tests - you will see the results when you submit to Canvas
### BEGIN HIDDEN TESTS
assert answer == 'B', 'Keep exploring your data.'
### END HIDDEN TESTS

**Extra Practice!!**

Now we're going to practice loading a dataset that you have saved on your computer. Since we're working in a notebook on Google Colab, you'll need to upload your file to the notebook runtime in order to read in the data.

* Use [this link](https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/NASCAR/nascard.csv) to download your dataset as a csv file to your computer. 
* Upload to Google Colab one of two ways:
    * using `from google.colab import files` then `uploaded = files.upload()`; read in the file with pd.read_csv(filename)
    * using the file upload feature in the left side panel of the Colab notebook; read in the file with pd.read_csv(filename)

More information about this data can be found [here](https://github.com/LambdaSchool/data-science-practice-datasets/tree/main/unit_1/NASCAR).

**DELETE or COMMENT OUT your Colab code**

In [52]:
# Extra Practice

## DELETE OR COMMENT OUT YOUR COLAB CODE BEFORE SUBMITTING THE NOTEBOOK