# All material ©2019, Alex Siegman

---

# Descriptive Analysis in Python

In [None]:
import pandas as pd # importing the Pandas library

#####  For a full list of all the possible Pandas operations:  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

In [None]:
!pwd # AKA, 'Print Working Directory' – tells us what folder I am in right now.

# think of this like using your mouse to click into and out of folders on your desktop. 
# this is just bypassing that UI.

# the '!' allows you to execute a shell command. Basically, you're working as if you would in 
# your terminal, but from the Jupyter Notebook.

In [None]:
ls # list all of the files in the current directory (remember, directory = folder in UI world)..

In [None]:
df = pd.read_csv('./SternTech_UserData.csv',encoding='utf-8') # read in the csv

# we are setting our dataset equal to the value 'df'.
# we can name this anything at all, it doesn't matter.
# df is commonplace, though, and stands for 'data frame'.

# you can ignore the 'encoding' piece for now, we'll get to that later on when we talk about web scraping. 

In [None]:
pd.options.display.max_rows = 2000 # the way Jupyter Notebook tends to display the results of such queries isn't 
                                   # always helpful, but we can very easily change that.
                                   # this will ensure we can view up to 2,000 rows without seeing elipses in the UI
    
pd.options.display.max_columns = 50 # try commenting out this last line ('max_columns =50') then run the cell below
                                    # to see the difference this formatting makes 

In [None]:
# let's drop that "unnamed" column 

df = df.drop(df.columns[[0]],axis=1)

In [None]:
df.head() # this gets the first five rows of data in your data frame 
          # df.tail() will give you the last five rows
          # if you want, you can choose any number - df.head(15) would give you the first 15 rows, for instance

In [None]:
list(df) # get a list of all the column names for your data frame

# we'll discuss this later on, but note that a list is comprised of comma-separated values inside of square brackets
# you can also use "df.columns" if you prefer, which will give you a similar output

In [None]:
df.describe() # get the basic statitical metrics for a data frame

## Group By

In [None]:
age = 
age.describe()

In [None]:
age = 
age.describe()

---

## Working With Time...

#### https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [None]:
import datetime # import the datetime library 

### There are three major ways you can deal with time in Pandas: 

1. Timestamps (DatetimeIndex, datetime64)
2. Periods / Timespans (PeriodIndex, period[freq]) 
3. Timedeltas (TimedeltaIndex, timedelta64)

In [None]:
# timestamps 

timestamps = pd.Timestamp(datetime.datetime(2019, 5, 9))
timestamps

In [None]:
# periods

periods = pd.Series(pd.period_range('1/1/2019', freq='M', periods=3))
periods

### It's important to note that both timestamps and periods can serve as an index. 

#### Remember, an index is a way to access a data point in a data frame. 

In [None]:
 # let's reset our index 

In [None]:
 # check the first five rows of our datset 

In [None]:
 # convert to datetime 

In [None]:
 # make sure we've properly converted from str to timestamp

In [None]:
 # set our index to the timestamp column

In [None]:
 # check the first five rows of our dataframe 

In [None]:
import numpy as np

periods =  # let's group by month 
periods

In [None]:
start_date =  # set an arbitray start date
end_date =  # set an arbitray end date

mask =  # create our mask

In [None]:
df = 
df # check our mask results

## Dummy Variables:

Often times, you will have Yes / No values, or categorical variables, that you need to transform into numerical data. The easiest way to do that is with dummy variables.

In [None]:
df = 

In [None]:
df.head()

## Last but not least, a bit of Data Viz:

_P.S. We'll delve much further into data viz later on in this course, so don't worry if this appears too basic_

In [None]:
df.plot() # note that it's only going to plot the numerical data, and it isn't very helpful...

In [None]:
df = 

In [None]:
df.head()

In [None]:
df.plot() # not very helpful, but easy!

---