# All material ©2019, Alex Siegman

---

# Data Analysis with Python and Pandas 

In [None]:
import pandas as pd # importing the Pandas library

#### For a full list of all the possible Pandas operations:  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

## Data Upload

### The first step of data analysis is to actually get your data in the right place. 

### In order to upload our CSV (Commas Separated Values) into our Jupyter Notebook, we need to point our machine into the right folder, so to speak. 

### We can use Command Line commands to do so. (For more on the Command Line, check out "Unix 101" in the GitHub repo). 

In [None]:
!pwd # AKA, 'Print Working Directory' – tells us what folder I am in right now.

# think of this like using your mouse to click into and out of folders on your desktop. 
# this is just bypassing that UI.

# the '!' allows you to execute a shell command. Basically, you're working as if you would in 
# your terminal, but from the Jupyter Notebook.

In [None]:
ls # list all of the files in the current directory (remember, directory = folder in UI world)..

### Now that I'm in the right place (I see the CSV is in this folder), I can 'read' my CSV using the following command:

In [None]:
df = pd.read_csv('./SternTech_UserData.csv',encoding='utf-8') # read in the csv

# we are setting our dataset equal to the value 'df'.
# we can name this anything at all, it doesn't matter.
# df is commonplace, though, and stands for 'data frame'.

# you can ignore the 'encoding' piece for now, we'll get to that later on when we talk about web scraping. 

### Let's begin with a primary, exploratory analysis of our data...

In [None]:
pd.options.display.max_rows = 2000 # the way Jupyter Notebook tends to display the results of such queries isn't 
                                   # always helpful, but we can very easily change that.
                                   # this will ensure we can view up to 2,000 rows without seeing elipses in the UI
    
pd.options.display.max_columns = 50 # try commenting out this last line ('max_columns =50') then run the cell below
                                    # to see the difference this formatting makes 

In [None]:
df.head() # this gets the first five rows of data in your data frame 
          # df.tail() will give you the last five rows
          # if you want, you can choose any number - df.head(15) would give you the first 15 rows, for instance

In [None]:
list(df) # get a list of all the column names for your data frame

# we'll discuss this later on, but note that a list is comprised of comma-separated values inside of square brackets
# you can also use "df.columns" if you prefer, which will give you a similar output

In [None]:
# let's drop that "unnamed" column 

df = df.drop(df.columns[[0]],axis=1)

In [None]:
list(df)

In [None]:
df.describe() # get the basic statitical metrics for a data frame

In [None]:
df.count() # get a count of the non-NA cells for each column

In [None]:
df['sex'].value_counts() # see the non-NA cells for each value in a column

In [None]:
df.info() # just some basic information on the data types (strings, integers, floats, et. cetera) for each column

### It's important to note that our timestamp values are being stored as 'non-null object's' and not as timestamps, as we'd like. So, let's change that: 

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [None]:
df.head()

### A bit more primary exploratory analysis:

In [None]:
df.sample() # get a random sample value from the data frame

In [None]:
df['age'] # select a single column

In [None]:
df.loc[:, ['age','sex']] # select multiple columns

# .loc is used for labels/names

In [None]:
df.iloc[3] # get information on a single row 

# .iloc is used for position numbers

In [None]:
df.iloc[3,6] # get the value of the 7th column (ad_type) for the 4th row (3rd index)

In [None]:
df['age'].mean() # get the mean of a column

In [None]:
df.sort_values(by="age",ascending=False) # sort by age

In [None]:
df[df['age'] < 21] # see any rows where age < 21

---

## That's a lot, I know. Next class we will continue using Pandas and some other Python libraries to delve further into the world of descriptive analytics. 

## For now, take some time to review this notebook and keep practicing!