# All material ©2019, Alex Siegman

---

# Descriptive Analysis in Python

## What is Pandas? 

Pandas (https://pandas.pydata.org/) is an open source library that allows you to easily work with and analyze structured data in Python. 

## Why is Pandas Useful? 

Let's think about Stern Technologies, the fictitious organization around which this entire course is based. Remember, they are an AdTech organization {tktktk}. 

Imagine you have just been hired at Stern Technologies, and you have been asked to genreate a basic report on Stern Tech users.  

In [301]:
import pandas as pd # importing the Pandas library

#####  For a full list of all the possible Pandas operations:  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

### In order to upload our CSV (Commas Separated Values) into our Jupyter Notebook, we need to point our machine into the right folder, so to speak. We can use Command Line commands to do so. For more on the Command Line, check out "Unix 101" in the GitHub repo. 

##### Because Stern Technologies isn't real, we're using a dataset that I (Alex) manufactured. Remember, NONE OF THESE CUSTOMERS ARE REAL. 

In [302]:
!pwd # AKA, 'Print Working Directory' – tells us what folder I am in right now.

# think of this like using your mouse to click into and out of folders on your desktop. 
# this is just bypassing that UI.

# the '!' allows you to execute a shell command. Basically, you're working as if you would in 
# your terminal, but from the Jupyter Notebook.

/Users/siegmanA/Desktop/NYU-Projects-in-Programming-Fall-2019/(Class 3) Descriptive Data Analysis


In [303]:
ls # list all of the files in the current directory (remember, directory = folder in UI world)..

Descriptive Analysis in Python.ipynb  SternTech_UserData.csv


### Now that I'm in the right place, I can 'read' my CSV using the following command:

In [304]:
df = pd.read_csv('./SternTech_UserData.csv',encoding='utf-8') # read in the csv

# we are setting our dataset equal to the value 'df'.
# we can name this anything at all, it doesn't matter.
# df is commonplace, though, and stands for 'data frame'.

# you can ignore the 'encoding' piece for now, we'll get to that later on when we talk about web scraping. 

### Let's begin with a primary, exploratory analysis of our data...

In [305]:
pd.options.display.max_rows = 2000 # the way Jupyter Notebook tends to display the results of such queries isn't 
                                   # always helpful, but we can very easily change that.
                                   # this will ensure we can view up to 2,000 rows without seeing elipses in the UI
    
pd.options.display.max_columns = 50 # try commenting out this last line ('max_columns =50') then run the cell below
                                    # to see the difference this formatting makes 

In [306]:
# let's drop that "unnamed" column 

df = df.drop(df.columns[[0]],axis=1)

In [307]:
df.head() # this gets the first five rows of data in your data frame 
          # df.tail() will give you the last five rows
          # if you want, you can choose any number - df.head(15) would give you the first 15 rows, for instance

Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
0,081217b4-1cf5-4657-8287-6db1b75462e4,large,92,M,Yes,Business,MidWest,2018-08-26 06:00:27.124290
1,d0b45a01-b73d-4f8e-bfa8-c53ea75397f1,large,56,M,Yes,Culinary,SouthWest,2011-06-01 18:54:34.815634
2,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast,2013-07-16 00:24:47.888180
3,5d09d6d4-023e-4fa1-9559-89526679e885,large,55,F,Yes,Political,NorthWest,2010-06-25 12:13:51.369878
4,b69e54e3-fc89-4c0f-8bdb-280409db173e,medium,25,N,No,Tech,US,2010-09-22 07:53:12.454909


In [308]:
list(df) # get a list of all the column names for your data frame

# we'll discuss this later on, but note that a list is comprised of comma-separated values inside of square brackets
# you can also use "df.columns" if you prefer, which will give you a similar output

['id',
 'company_size',
 'age',
 'sex',
 'clicked_on_ad',
 'ad_type',
 'location',
 'timestamp']

In [309]:
df.describe() # get the basic statitical metrics for a data frame

Unnamed: 0,age
count,50000.0
mean,58.4073
std,23.679151
min,18.0
25%,38.0
50%,58.0
75%,79.0
max,99.0


### Group By

In [310]:
age = df.groupby('age')
age.describe()

Unnamed: 0_level_0,id,id,id,id,company_size,company_size,company_size,company_size,sex,sex,sex,sex,clicked_on_ad,clicked_on_ad,clicked_on_ad,clicked_on_ad,ad_type,ad_type,ad_type,ad_type,location,location,location,location,timestamp,timestamp,timestamp,timestamp
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq
age,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2
18,649,649,6719c6ab-79b1-4d90-b591-3a564205b6aa,1,649,4,medium,171,649,3,N,236,649,2,Yes,344,649,8,Tech,92,649,9,NorthWest,85,649,649,2001-09-24 13:30:25.039777,1
19,592,592,19b53fe5-59e4-49a0-bd46-99c2dd062dd0,1,592,4,small,158,592,3,N,207,592,2,Yes,302,592,8,Business,82,592,9,NorthEast,83,592,592,2019-06-25 11:25:18.501965,1
20,601,601,21a5d20d-1176-4b51-be59-927402aee277,1,601,4,large,168,601,3,N,217,601,2,No,301,601,8,Real Estate,92,601,9,Mexico,81,601,601,2010-11-28 22:41:05.579943,1
21,647,647,99d03850-2774-46c9-b2a8-fabbde2e22e0,1,647,4,large,181,647,3,F,224,647,2,No,337,647,8,Political,93,647,9,NorthWest,79,647,647,2011-01-05 03:05:24.791586,1
22,608,608,f557b1a1-cf6b-4859-85f1-ddede4de47d0,1,608,4,medium,172,608,3,M,209,608,2,Yes,328,608,8,Travel,94,608,9,SouthAmerica,75,608,608,2018-11-02 15:35:07.250189,1
23,579,579,0d6a3698-7400-492e-b04e-083d3df13626,1,579,4,medium,155,579,3,F,203,579,2,No,300,579,8,Business,90,579,9,SouthWest,81,579,579,2001-01-07 01:58:56.140579,1
24,632,632,2222a17e-7d51-40c4-858c-6936be64adc2,1,632,4,large,161,632,3,N,217,632,2,Yes,320,632,8,Business,93,632,9,Canada,87,632,632,2007-06-04 11:28:43.555889,1
25,665,665,c5e55ed7-af50-451d-979d-82a2f6334dab,1,665,4,startup,172,665,3,N,238,665,2,Yes,339,665,8,Travel,95,665,9,Canada,95,665,665,2010-09-18 08:34:16.785605,1
26,583,583,fd6b6d9b-5ea0-4aeb-90a8-bd2f3457a17a,1,583,4,startup,152,583,3,F,210,583,2,Yes,300,583,8,Political,85,583,9,US,82,583,583,2005-08-08 21:46:33.013406,1
27,621,621,2218258f-f75e-4ff7-b0e1-6670ce6446e0,1,621,4,medium,160,621,3,N,230,621,2,Yes,319,621,8,Real Estate,97,621,9,SouthWest,80,621,621,2017-08-11 23:43:42.777812,1


In [311]:
age = df.groupby(['age','clicked_on_ad'])
age.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,id,id,id,id,company_size,company_size,company_size,company_size,sex,sex,sex,sex,ad_type,ad_type,ad_type,ad_type,location,location,location,location,timestamp,timestamp,timestamp,timestamp
Unnamed: 0_level_1,Unnamed: 1_level_1,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq,count,unique,top,freq
age,clicked_on_ad,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2
18,No,305,305,c48ce2b9-4834-44f5-a193-8cd13dbb1f3a,1,305,4,large,81,305,3,F,107,305,8,Political,45,305,9,SouthWest,40,305,305,2000-11-13 10:11:28.444173,1
18,Yes,344,344,cde6dbac-6bbb-4027-a69b-c874b75f612d,1,344,4,medium,91,344,3,N,131,344,8,Luxury,50,344,9,NorthWest,47,344,344,2001-09-24 13:30:25.039777,1
19,No,290,290,19b53fe5-59e4-49a0-bd46-99c2dd062dd0,1,290,4,small,79,290,3,F,103,290,8,Luxury,48,290,9,NorthEast,43,290,290,2019-06-25 11:25:18.501965,1
19,Yes,302,302,95740270-c08e-4144-951e-427712258f44,1,302,4,startup,81,302,3,N,106,302,8,Political,44,302,9,NorthWest,41,302,302,2013-10-22 09:44:54.985696,1
20,No,301,301,80485082-35ef-48be-996b-2e33335a7a64,1,301,4,large,91,301,3,N,112,301,8,Business,47,301,9,Canada,43,301,301,2017-12-09 21:07:10.481542,1
20,Yes,300,300,21a5d20d-1176-4b51-be59-927402aee277,1,300,4,medium,81,300,3,N,105,300,8,Real Estate,45,300,9,Mexico,40,300,300,2012-02-01 00:40:09.797146,1
21,No,337,337,99d03850-2774-46c9-b2a8-fabbde2e22e0,1,337,4,large,94,337,3,M,114,337,8,Political,52,337,9,SouthWest,43,337,337,2016-08-07 21:54:05.461732,1
21,Yes,310,310,01a7a1a1-1cdb-48cf-8587-633c2f4b23c7,1,310,4,large,87,310,3,F,110,310,8,Business,47,310,9,NorthWest,43,310,310,2001-05-20 17:36:21.722997,1
22,No,280,280,8bf04857-a636-405e-884a-25b8156265b9,1,280,4,medium,76,280,3,F,97,280,8,Travel,50,280,9,NorthEast,39,280,280,2018-11-02 15:35:07.250189,1
22,Yes,328,328,f557b1a1-cf6b-4859-85f1-ddede4de47d0,1,328,4,medium,96,328,3,M,116,328,8,Business,49,328,9,NorthWest,46,328,328,2013-07-25 12:57:03.814328,1


---

## Working with time...

### https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [312]:
import datetime

### There are three major ways you can deal with time in Pandas: 

1. Timestamps (DatetimeIndex, datetime64)
2. Periods / Timespans (PeriodIndex, period[freq]) 
3. Timedeltas (TimedeltaIndex, timedelta64)

In [313]:
# timestamps 

timestamps = pd.Timestamp(datetime.datetime(2019, 5, 9))
timestamps

Timestamp('2019-05-09 00:00:00')

In [314]:
# periods

periods = pd.Series(pd.period_range('1/1/2019', freq='M', periods=3))
periods

0    2019-01
1    2019-02
2    2019-03
dtype: period[M]

### It's important to note that both timestamps and periods can serve as an index. 

#### Remember, an index is a way to access a data point in a data frame. 

In [315]:
df.reset_index().head() # let's reset our index 

Unnamed: 0,index,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
0,0,081217b4-1cf5-4657-8287-6db1b75462e4,large,92,M,Yes,Business,MidWest,2018-08-26 06:00:27.124290
1,1,d0b45a01-b73d-4f8e-bfa8-c53ea75397f1,large,56,M,Yes,Culinary,SouthWest,2011-06-01 18:54:34.815634
2,2,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast,2013-07-16 00:24:47.888180
3,3,5d09d6d4-023e-4fa1-9559-89526679e885,large,55,F,Yes,Political,NorthWest,2010-06-25 12:13:51.369878
4,4,b69e54e3-fc89-4c0f-8bdb-280409db173e,medium,25,N,No,Tech,US,2010-09-22 07:53:12.454909


In [316]:
df.head()

Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
0,081217b4-1cf5-4657-8287-6db1b75462e4,large,92,M,Yes,Business,MidWest,2018-08-26 06:00:27.124290
1,d0b45a01-b73d-4f8e-bfa8-c53ea75397f1,large,56,M,Yes,Culinary,SouthWest,2011-06-01 18:54:34.815634
2,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast,2013-07-16 00:24:47.888180
3,5d09d6d4-023e-4fa1-9559-89526679e885,large,55,F,Yes,Political,NorthWest,2010-06-25 12:13:51.369878
4,b69e54e3-fc89-4c0f-8bdb-280409db173e,medium,25,N,No,Tech,US,2010-09-22 07:53:12.454909


In [317]:
df['timestamp'] = pd.to_datetime(df['timestamp'],format='%Y-%m-%d %H:%M:%S')

In [318]:
df.head()

Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
0,081217b4-1cf5-4657-8287-6db1b75462e4,large,92,M,Yes,Business,MidWest,2018-08-26 06:00:27.124290
1,d0b45a01-b73d-4f8e-bfa8-c53ea75397f1,large,56,M,Yes,Culinary,SouthWest,2011-06-01 18:54:34.815634
2,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast,2013-07-16 00:24:47.888180
3,5d09d6d4-023e-4fa1-9559-89526679e885,large,55,F,Yes,Political,NorthWest,2010-06-25 12:13:51.369878
4,b69e54e3-fc89-4c0f-8bdb-280409db173e,medium,25,N,No,Tech,US,2010-09-22 07:53:12.454909


In [319]:
type(df['timestamp'][0])

pandas._libs.tslibs.timestamps.Timestamp

In [320]:
df = df.set_index('timestamp')

In [321]:
df.head()

Unnamed: 0_level_0,id,company_size,age,sex,clicked_on_ad,ad_type,location
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2018-08-26 06:00:27.124290,081217b4-1cf5-4657-8287-6db1b75462e4,large,92,M,Yes,Business,MidWest
2011-06-01 18:54:34.815634,d0b45a01-b73d-4f8e-bfa8-c53ea75397f1,large,56,M,Yes,Culinary,SouthWest
2013-07-16 00:24:47.888180,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast
2010-06-25 12:13:51.369878,5d09d6d4-023e-4fa1-9559-89526679e885,large,55,F,Yes,Political,NorthWest
2010-09-22 07:53:12.454909,b69e54e3-fc89-4c0f-8bdb-280409db173e,medium,25,N,No,Tech,US


In [322]:
import numpy as np

periods = df.resample('M').count()
periods

Unnamed: 0_level_0,id,company_size,age,sex,clicked_on_ad,ad_type,location
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2000-01-31,206,206,206,206,206,206,206
2000-02-29,216,216,216,216,216,216,216
2000-03-31,211,211,211,211,211,211,211
2000-04-30,205,205,205,205,205,205,205
2000-05-31,206,206,206,206,206,206,206
2000-06-30,210,210,210,210,210,210,210
2000-07-31,234,234,234,234,234,234,234
2000-08-31,203,203,203,203,203,203,203
2000-09-30,204,204,204,204,204,204,204
2000-10-31,193,193,193,193,193,193,193


In [327]:
start_date = '2000-10-31'
end_date = '2001-06-30'

mask = (df.index > start_date) & (df.index <= date)

In [330]:
df = df.loc[mask]
df

Unnamed: 0_level_0,id,company_size,age,sex,clicked_on_ad,ad_type,location
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2011-06-01 18:54:34.815634,d0b45a01-b73d-4f8e-bfa8-c53ea75397f1,large,56,M,Yes,Culinary,SouthWest
2013-07-16 00:24:47.888180,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast
2010-06-25 12:13:51.369878,5d09d6d4-023e-4fa1-9559-89526679e885,large,55,F,Yes,Political,NorthWest
2010-09-22 07:53:12.454909,b69e54e3-fc89-4c0f-8bdb-280409db173e,medium,25,N,No,Tech,US
2003-05-31 07:24:51.873042,8b449b66-55fb-411b-a27d-81f88d12d6e2,startup,39,N,No,Real Estate,SouthEast
2008-10-10 16:28:52.002102,eedba0f5-0e4b-49b2-8409-62b24761a05a,large,76,F,No,Luxury,SouthEast
2002-09-15 07:49:05.836594,287cf905-00d3-4347-a5f5-e82876f7e309,medium,58,F,No,Culinary,SouthAmerica
2005-11-22 03:13:11.646444,91f3fc7f-42ec-4d5a-85c9-4e8292fef5ad,large,71,N,No,Travel,US
2009-02-17 07:44:36.928736,e013c5de-837c-450e-826b-94379acd9a08,startup,53,F,Yes,Luxury,US
2010-11-06 04:46:03.849307,fc9f29d9-6042-44c6-b996-04caefa6e2cc,large,39,M,No,Real Estate,NorthWest
