# All material ©2019, Alex Siegman

---

# Descriptive Analysis in Python

##  Let's read our CSV back in again: 

In [None]:
import pandas as pd # importing the Pandas library

df = pd.read_csv('./SternTech_UserData.csv',encoding='utf-8') # read in the csv

In [None]:
pd.options.display.max_rows = 2000 
    
pd.options.display.max_columns = 50 

In [None]:
df.head() 

In [None]:
df = df.drop(df.columns[[0]],axis=1) # let's get rid of our unnamed column 

In [None]:
df.head()

## Group By

In [None]:
age = df.groupby('age')
age.describe()

In [None]:
age = df.groupby(['age','clicked_on_ad'])
age.describe()

---

## Working With Time...

#### https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [None]:
import datetime # import the datetime library 

### There are three major ways you can deal with time in Pandas: 

1. Timestamps (DatetimeIndex, datetime64)
2. Periods / Timespans (PeriodIndex, period[freq]) 
3. Timedeltas (TimedeltaIndex, timedelta64)

In [None]:
# timestamps 

timestamps = pd.Timestamp(datetime.datetime(2019, 5, 9))
timestamps

In [None]:
# periods

periods = pd.Series(pd.period_range('1/1/2019', freq='M', periods=3))
periods

### It's important to note that both timestamps and periods can serve as an index. 

#### Remember, an index is a way to access a data point in a data frame. 

In [None]:
df.reset_index().head() # let's reset our index 

In [None]:
df.head() # check the first five rows of our datset 

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'],format='%Y-%m-%d %H:%M:%S') # convert to datetime 

In [None]:
type(df['timestamp'][0]) # make sure we've properly converted from str to timestamp

In [None]:
df = df.set_index('timestamp') # set our index to the timestamp column

In [None]:
df.head() # check the first five rows of our dataframe 

In [None]:
import numpy as np

periods = df.resample('M').mean() # let's group by month 
periods

In [None]:
start_date = '2000-10-31' # set an arbitray start date
end_date = '2001-06-30' # set an arbitray end date

mask = (df.index > start_date) & (df.index <= end_date) # create our mask

In [None]:
df = df.loc[mask]
df # check our mask results

## Dummy Variables:

Often times, you will have Yes / No values, or categorical variables, that you need to transform into numerical data. The easiest way to do that is with dummy variables.

In [None]:
df.head()

In [None]:
df = pd.concat([df.drop('clicked_on_ad',axis=1),pd.get_dummies((df['clicked_on_ad'])).rename(columns = lambda x: 'clicked_on_ad' + str(x))],axis=1)

In [None]:
df.head()

## Last but not least, a bit of Data Viz:

_P.S. We'll delve much further into data viz later on in this course, so don't worry if this appears too basic_

In [None]:
df.plot() # note that it's only going to plot the numerical data, and it isn't very helpful...

In [None]:
df = pd.concat([df.drop('location',axis=1),pd.get_dummies((df['location'])).rename(columns = lambda x: 'location' + str(x))],axis=1)

In [None]:
df.head()

In [None]:
df.plot() # not very helpful, but easy!

In [None]:
df.groupby('locationNorthEast')['age'].mean().plot(kind='bar')

In [None]:
df.groupby('sex')['locationNorthEast'].count().plot(kind='bar')

In [None]:
df[['age']].plot(kind='hist',bins=[0,10,20,30,40,50,60,70,80,90],rwidth=.5)

---