<a href="https://colab.research.google.com/github/pallavrouth/AI-Bootcamp/blob/main/Labs/Week3lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data information

Panel data on cigarette consumption for the 48 continental US States from 1985â€“1995.

A data frame containing 48 observations on 7 variables for 2 periods.
- `state`: Factor indicating state.
- `year`: Factor indicating year.
- `cpi`: Consumer price index.
- `population`: State population.
- `packs`: Number of packs per capita.
- `income`: State personal income (total, nominal).
- `tax`: Average state, federal and average local excise taxes for fiscal year. 
- `price`: Average price during fiscal year, including sales tax.
- `taxs`: Average excise taxes for fiscal year, including sales tax.

Information on variables - https://cran.r-project.org/web/packages/AER/AER.pdf and then click on cigarettesSW

Link to data : https://raw.githubusercontent.com/pallavrouth/AI-Bootcamp/main/Data/cigarettes.csv


## Pandas

1. Import the data set stored in CSV format
1. Pandas dataframe versus pandas series
2. Selecting (column wise operations)
    1. Select multiple columns 
    2. Select range of columns 
    3. Select multiple ranges of columns
    4. Droping columns
3. Filter (row wise operations)
    1. Based on one condition
    2. Based on multiple conditions
    3. Filter based on numeric versus categorical levels 
4. Mutate (column wise operations) - adding columns to existing data
5. Arrange (row wise operations) - sorting rows based on a variabel 
6. Summarize (column and row) - summarize groups in data
    1. Simple group operations
    2. Single group by operations
    3. Multiple group by operations on single column
    4. Multiple group by operations on multiple columns
7. Other tasks - rename columns, drop columns

## Plotting

1. Distribution of one (continous) variable - histogram
2. Distribution of categorical variables - barplot/boxplot
3. Distribution of 2 variables - scatter plot/ line plot
4. Plot groups of variables by color/shape/facets

# Monday

## 1. Import the data frame

In [None]:
import pandas as pd

In [None]:
my_data = pd.read_csv('https://raw.githubusercontent.com/pallavrouth/AI-Bootcamp/main/Data/cigarettes.csv',
                      index_col = 0)
my_data.head()

##2. Series verus dataframe

In general, using `[]` gives you a series and using `[[]]` gives you a dataframe. I like using the latter because it helps you manipulate data easily. More importantly a series is just the series of values under one column. Therefore, a dataframe can be considered to be a collection of different series.

In [None]:
my_data['income'].head()
my_data['income']
my_data[['income','tax','packs']][['income']]
type(my_data[['income','tax','packs']][['income']])

##3. Selecting columns

Below are different syntaxes to select a colum from the data

In [None]:
my_data['income'].head()
my_data[['income']].head()
my_data.income.head()
my_data.iloc[:,[5]].head()
my_data.loc[:,['income']].head()
my_data.filter(regex = 'income').head()

I like to use the loc syntax. How do you use iloc to select multiple columns -

In [None]:
my_data.loc[:,['income','tax','price']].head()

How to use loc to select a range of columns using `:` operator.  Note : When using a colon do not use the `[]` 

In [None]:
# using : to select a single range of columns
my_data.loc[:,'income':'price'].head()

# using : to select multiple ranges of columns
list(my_data.loc[:,'income':'price'])
list(my_data.loc[:,'year':'population'])
col_names = list(my_data.loc[:,'income':'price']) + list(my_data.loc[:,'year':'population'])
my_data.loc[:,col_names].head()

## 4. Filtering rows

Different ways to filter rows

In [None]:
my_data[my_data.tax > 40].head()
my_data[my_data['tax'] > 40].head()
my_data.loc[my_data.tax > 40,:].head()
my_data.loc[my_data['tax'] > 40,:].head()

General syntax : dataframe.column_name or dataframe['column_name']

How to filter a column with non-numeric data. Using the `isin` operator.

In [None]:
my_data.loc[  my_data['state'].isin(['CT','MA','AL']),:].head()

## 4. Mutating

How to add a column to the data

In [None]:
my_data['ratio1'] = my_data['price'] / my_data['tax']
my_data['ratio2'] = my_data['population'] / my_data['packs']
my_data

However, the above method cannot be 'chained'. To use chaining use the assign function. More importantly, assign function feels more natural and lets to add many columns all at once. However, you must use the `lambda` functionality in python to use assign.

In [None]:
my_data.assign(ratio1 = lambda x: x['price'] / x['tax'],
               ratio2 = lambda x: x['population'] / x['packs'])

## 5. Arrangeing rows

We can effective sort or arrange rows using the `sort_values` function.

In [None]:
my_data.sort_values('income').head()
my_data.sort_values(['income','tax']).head()

## 6. Summarize groups

Sometimes we want to group by certain groups in the data and summarise the groups.

Simple group by operations such as value_count() can be done without much fuss. 

In [None]:
my_data['state'].value_counts()

More complex group by operations require using the `group_by` function

In [None]:
my_data.groupby('state')[['price']].mean().head()
my_data.groupby('state')[['price','tax']].mean().head()

Group by and aggregate in multiple ways. For example, I may want to group by and find both the average as well as the minimum. The `agg` function is really useful here.

In [None]:
my_data.groupby('state')[['price']].agg(['mean','min'])

I may want to group by and use different aggregation functions on different columns. Again use the `agg` function but have to supply a dictionary rather than a list

In [None]:
my_data.groupby('state')[['price','tax']].agg({'price':'mean','tax':'min'})

If there are no pre existing functions, you have to use `apply` function. To be continued on Friday

# Friday

Group by state, calculate average price and average tax and then find its ratio

In [None]:
my_data.groupby('state').apply(lambda x : x['price'].mean() / x['tax'].mean() )

Matt's homework

1. Get the data in

In [None]:
import pandas as pd
import requests

download_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
target_csv_path = "nba_all_elo.csv"

response = requests.get(download_url)
response.raise_for_status()    # Check that the request was successful
with open(target_csv_path, "wb") as f:
    f.write(response.content)

nba = pd.read_csv("nba_all_elo.csv")

1. What is the average for points scored in wins and losses?

Strategy 

Step 1 : First you must FILTER out the rows in the data that correspond to wins and losses

In [None]:
wins = nba[(nba['game_result'] == "W")] 

Step 2 : Now extract the column that stores the 'points' information and then find the mean of this column

In [None]:
wins[['pts']].mean()

Note: There are other ways of writing the above code 

In [None]:
wins.pts.mean()
wins['pts'].mean()
wins.loc[:,['pts']].mean()

They all give you the same result. 

There is another strategy to get the answer. 

Step 1: You group by the win category and loss category in the data

In [None]:
group_data = nba.groupby(['game_result'])

Step 2: You take this groups and extract the column for points and then find its mean

In [None]:
group_data[['pts']].mean()

2: How has the average and median points scored for winning and losing team for both team changed by decade?

My solution follows this strategy

Step 1: I create a new column in the data that says which decade a year belongs to. I create this function that can do that.

In [None]:
def get_decade(year):
  if year > 1949 and year < 1961:
    return '1950s'
  elif year > 1959 and year < 1971:
    return '1960s'
  elif year > 1969 and year < 1981:
    return '1970s'
  elif year > 1979 and year < 1991:
    return '1980s' 
  elif year > 1989 and year < 2001:
    return '1990s'
  elif year > 1999 and year < 2010:
    return '2000s' 
  else:
    return '2010s'

In [None]:
# test the function
get_decade(1966)

Step 2 : Now apply this function to year column in the data and add it to teh dataset

In [None]:
nba['decade'] = nba['year_id'].apply(get_decade)
nba.columns

Step 3 : Now group by decade and calculate average points for winning team

In [None]:
nba[(nba['game_result'] == "W")].groupby('decade')[['pts']].mean()