# Exercises - Classification

**The end product of this exercise is a jupyter notebook (**
`classification_exercises.ipynb`
**) and** 
`acquire.py`
**. The notebook will contain all your work as you move through the exercises. The** 
`acquire.py` 
**file should contain the final functions.**

In [1]:
import numpy as np
import pandas as pd

import acquire as acq

from debug import local_settings, timeifdebug, timeargsifdebug, frame_splain


In [2]:
local_settings['DEBUG']=True

In [3]:
local_settings['SPLAIN']=True

In [4]:
local_settings

{'DEBUG': True,
 'ARGS': False,
 'KWARGS': False,
 'SPLAIN': True,
 'TOPX': 5,
 'MAXCOLS': 10}

### In a jupyter notebook, `classification_exercises.ipynb`:

1. **Use a python module containing datasets as a source from the** 
`iris` 
**data. Create a pandas dataframe,** 
`df_iris`
**, from this data.**

In [5]:
df_iris = pd.read_csv('iris.csv')

  - print the first 3 rows

In [6]:
df_iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


  - print the number of rows and columns (shape)

In [7]:
df_iris.shape

(150, 5)

  - print the column names

In [8]:
frame_splain(df_iris, 'df_iris', splain=True)

DF_IRIS SHAPE:
(150, 5) 

DF_IRIS INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
None 

DF_IRIS DESCRIPTION:
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000 

DF_IRIS HEAD:
   sepal_length  sepal_width  petal_

  - print the data type of each column

In [9]:
df_iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

   - print the summary statistics for each of the numeric variables. Would you recommend rescaling the data based on these statistics?

In [10]:
df_iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


2. **Read** 
`Table1_CustDetails` 
**the excel module dataset,** 
`Excel_Exercises.xlsx`
**, into a dataframe,** 
`df_excel`
**.**

In [11]:
df_excel = acq.excel_df('Excel_Exercises.xlsx', splain=1)

starting excel_df
starting df_df
DATAFRAME SHAPE:
(7049, 12) 

DATAFRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7049 entries, 0 to 7048
Data columns (total 12 columns):
customer_id          7049 non-null object
gender               7049 non-null object
is_senior_citizen    7049 non-null int64
partner              7049 non-null object
dependents           7049 non-null object
phone_service        7049 non-null int64
internet_service     7049 non-null int64
contract_type        7049 non-null int64
payment_type         7049 non-null object
monthly_charges      7049 non-null float64
total_charges        7038 non-null float64
churn                7049 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 660.9+ KB
None 

ending df_df ; time: 0.008289098739624023
ending excel_df ; time: 2.0617289543151855


  - assign the first 100 rows to a new dataframe, `df_excel_sample`.

In [12]:
df_excel_sample = acq.df_df(df_excel.head(100), splain=1)

starting df_df
DATAFRAME SHAPE:
(100, 12) 

DATAFRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 12 columns):
customer_id          100 non-null object
gender               100 non-null object
is_senior_citizen    100 non-null int64
partner              100 non-null object
dependents           100 non-null object
phone_service        100 non-null int64
internet_service     100 non-null int64
contract_type        100 non-null int64
payment_type         100 non-null object
monthly_charges      100 non-null float64
total_charges        100 non-null float64
churn                100 non-null object
dtypes: float64(2), int64(4), object(6)
memory usage: 9.5+ KB
None 

ending df_df ; time: 0.007348060607910156


  - print the number of rows of your original dataframe

In [13]:
df_excel.shape[0]

7049

  - print the first 5 column names

In [14]:
df_excel.columns[:5]

Index(['customer_id', 'gender', 'is_senior_citizen', 'partner', 'dependents'], dtype='object')

  - print the column names that have a data type of object

In [15]:
print(df_excel.dtypes)
print(df_excel.columns[df_excel.dtypes=='object'])

customer_id           object
gender                object
is_senior_citizen      int64
partner               object
dependents            object
phone_service          int64
internet_service       int64
contract_type          int64
payment_type          object
monthly_charges      float64
total_charges        float64
churn                 object
dtype: object
Index(['customer_id', 'gender', 'partner', 'dependents', 'payment_type',
       'churn'],
      dtype='object')


  - compute the range for each of the numeric variables.

In [16]:
df_excel.describe()

Unnamed: 0,is_senior_citizen,phone_service,internet_service,contract_type,monthly_charges,total_charges
count,7049.0,7049.0,7049.0,7049.0,7049.0,7038.0
mean,0.162009,1.324585,1.222585,0.690878,64.747014,2283.043883
std,0.368485,0.642709,0.779068,0.833757,30.09946,2266.521984
min,0.0,0.0,0.0,0.0,18.25,18.8
25%,0.0,1.0,1.0,0.0,35.45,401.5875
50%,0.0,1.0,1.0,0.0,70.35,1397.1
75%,0.0,2.0,2.0,1.0,89.85,3793.775
max,1.0,2.0,2.0,2.0,118.75,8684.8


3. **Read the data from [this google sheet](https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit?usp=sharing) into a dataframe,** `df_google`
**.**

In [17]:
google_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'
df_google = acq.google_df(google_url, splain=1)

starting google_df
starting df_df
DATAFRAME SHAPE:
(891, 12) 

DATAFRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None 

ending df_df ; time: 0.005249977111816406
ending google_df ; time: 0.40873217582702637


  - print the first 3 rows

In [18]:
df_google.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


  - print the number of rows and columns

In [19]:
df_google.shape

(891, 12)

  - print the column names

In [20]:
df_google.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

  - print the data type of each column

In [21]:
df_google.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

  - print the summary statistics for each of the numeric variables

In [22]:
df_google.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


  - print the unique values for each of your categorical variables

### In a new python module, `acquire.py`:

1. **`get_titanic_data`:** returns the titanic data from the Codeup data science database as a pandas data frame.

2. **`get_iris_data`:** returns the data from the `iris_db` on the Codeup data science database as a pandas data frame. The returned data frame should include the actual name of the species in addition to the species_ids.

In [23]:
sql='''select measurements.*, species.species_name from measurements join species using (species_id)'''

In [24]:
db_url = acq.sql_df(sql, 'iris_db', splain=1)

starting sql_df
starting get_db_url
ending get_db_url ; time: 3.814697265625e-06
starting df_df
DATAFRAME SHAPE:
(150, 7) 

DATAFRAME INFO:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 7 columns):
measurement_id    150 non-null int64
sepal_length      150 non-null float64
sepal_width       150 non-null float64
petal_length      150 non-null float64
petal_width       150 non-null float64
species_id        150 non-null int64
species_name      150 non-null object
dtypes: float64(4), int64(2), object(1)
memory usage: 8.3+ KB
None 

DATAFRAME DESCRIPTION:
       measurement_id  sepal_length  sepal_width  petal_length  petal_width  \
count      150.000000    150.000000   150.000000    150.000000   150.000000   
mean        75.500000      5.843333     3.057333      3.758000     1.199333   
std         43.445368      0.828066     0.435866      1.765298     0.762238   
min          1.000000      4.300000     2.000000      1.000000     0.100000   
2