# IS4487 Module 2 - Practice Code

This notebook is designed to help you follow along with the **Week 2 Lecture and Reading**

The practice code demos are intended to give you a chance to see working code and can be a source for your lab and assignment work.  Each section contains short explanations and annotated code.

### Topics for this demo:
- Importing data to a dataframe
- Filtering rows in a dataframe
- Reshaping/selecting columns in a dataframe
- Sorting
- Aggregating data

<a href="https://colab.research.google.com/github/vandanara/UofUtah_IS4487/blob/main/Demos/demo_02_dataframe_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

More information:
- Learn more about Colab here:  https://research.google.com/colaboratory/faq.html
- Learn more about Pandas here: https://pandas.pydata.org/docs/user_guide/10min.html


### Context: Motor Trend Car Road Tests
This example uses a small set of data from the 1970s with road tests from cars.  This is the classic dataset that statisticians have been using for the last 50 years to learn to work with data.  

| Column | Description                              |
| ------ | ---------------------------------------- |
| mpg    | Miles per gallon (fuel efficiency)       |
| cyl    | Number of cylinders                      |
| disp   | Displacement (cu. in.)                   |
| hp     | Gross horsepower                         |
| drat   | Rear axle ratio                          |
| wt     | Weight (1000 lbs)                        |
| qsec   | Â¼ mile time                              |
| vs     | Engine type (0 = V-shaped, 1 = straight) |
| am     | Transmission (0 = automatic, 1 = manual) |
| gear   | Number of forward gears                  |
| carb   | Number of carburetors                    |


Your task is to import the data into a dataframe and learn to work with it as you would an Excel sheet.

### Import libraries

We will import two libraries
- Pandas, which is like Excel for Python.  It creates 2-dimensional data frames and lets you work with the rows and columns.  
- StatsModels has sample data for use in experimenting with Python

In [None]:
import pandas as pd
import statsmodels.api as sm

### Import Sample Data

Use the data from Lab 1

In [None]:
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
print(df)

### Create Summary Statistics

We will use Pandas functions to preview the data

In [None]:
df.info()

In [None]:
df.describe()

### Work with the DataFrame

FILTERING: We will filter the dataset

In [None]:
#remove the Toyota Corolla row
df2 = df[df.index != 'Toyota Corolla']
print(df2)

RESHAPING: We can select only some columns

In [None]:
#create a new dataframe with the first two columns
df3 = df2[['mpg', 'cyl']]
print(df3)

SORTING the rows

The command sort_values is called a method, it is accessed using the dot operator . Each method allows one or more parameters - which are options to customize the method's behavior.

parameter by allows us to specify one or more columns to sort on; place the col names inside the [] of the by parameter with quotes ' ' or " "

in_place allows you to ask for the results of the sorting to replace the data in the DataFrame.

To see all the parameters available, type the parentheses after the method, and Colanb will automaticlaly bring up the documentation - super helpful!

In [None]:
#sort the rows by mpg
df3.sort_values(by=['mpg'], inplace=True)
print(df3)

#### AGGREGATIONS

We can perform groupby aggregations, when we wish to perform some calculation on data that is grouped by specified columns.

The command to perform grouping by columsn is Df.groupby('colname') or DF.groupby(['col1, 'col2'])) if there are multiple columns.

`DF.size()` will give us the number of rows in each group

The resulting output of `DF.size()` is a table with a single column. It has a row index - which is the number of cylinders, and one column - which contains the number of rows(cars) with that number of cylinders. The col name is '0' which is meaningless.

We can change this to a new DataFrame by calling the command `DF.reset_index(name='count')` which will do the following:
- add anew index 0,1,2,3,
- the column of counts will now be called 'count'. you can specify any name you want to use

In [None]:
#aggregate the data to get the number of cars with each cylinder count
df4 = df3.groupby('cyl').size().reset_index(name='count')
print(df4)