# All material ©2019, Alex Siegman

---

# Data Analysis with Python and Pandas 

In [1]:
import pandas as pd # importing the Pandas library

#### For a full list of all the possible Pandas operations:  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

## Data Upload

### The first step of data analysis is to actually get your data in the right place. 

### In order to upload our CSV (Commas Separated Values) into our Jupyter Notebook, we need to point our machine into the right folder, so to speak. 

### We can use Command Line commands to do so. (For more on the Command Line, check out "Unix 101" in the GitHub repo). 

In [2]:
!pwd # AKA, 'Print Working Directory' – tells us what folder I am in right now.

# think of this like using your mouse to click into and out of folders on your desktop. 
# this is just bypassing that UI.

# the '!' allows you to execute a shell command. Basically, you're working as if you would in 
# your terminal, but from the Jupyter Notebook.

/Users/siegmanA/Desktop/NYU-Projects-in-Programming-Fall-2019/(Class 2) Python and Pandas


In [3]:
ls # list all of the files in the current directory (remember, directory = folder in UI world)..

Python and Pandas Solved.ipynb  SternTech_UserData.csv


### Now that I'm in the right place (I see the CSV is in this folder), I can 'read' my CSV using the following command:

In [4]:
df = pd.read_csv('./SternTech_UserData.csv',encoding='utf-8') # read in the csv

# we are setting our dataset equal to the value 'df'.
# we can name this anything at all, it doesn't matter.
# df is commonplace, though, and stands for 'data frame'.

# you can ignore the 'encoding' piece for now, we'll get to that later on when we talk about web scraping. 

### Let's begin with a primary, exploratory analysis of our data...

In [5]:
pd.options.display.max_rows = 2000 # the way Jupyter Notebook tends to display the results of such queries isn't 
                                   # always helpful, but we can very easily change that.
                                   # this will ensure we can view up to 2,000 rows without seeing elipses in the UI
    
pd.options.display.max_columns = 50 # try commenting out this last line ('max_columns =50') then run the cell below
                                    # to see the difference this formatting makes 

In [6]:
df.head() # this gets the first five rows of data in your data frame 
          # df.tail() will give you the last five rows
          # if you want, you can choose any number - df.head(15) would give you the first 15 rows, for instance

Unnamed: 0.1,Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
0,0,081217b4-1cf5-4657-8287-6db1b75462e4,large,92,M,Yes,Business,MidWest,2018-08-26 06:00:27.124290
1,1,d0b45a01-b73d-4f8e-bfa8-c53ea75397f1,large,56,M,Yes,Culinary,SouthWest,2011-06-01 18:54:34.815634
2,2,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast,2013-07-16 00:24:47.888180
3,3,5d09d6d4-023e-4fa1-9559-89526679e885,large,55,F,Yes,Political,NorthWest,2010-06-25 12:13:51.369878
4,4,b69e54e3-fc89-4c0f-8bdb-280409db173e,medium,25,N,No,Tech,US,2010-09-22 07:53:12.454909


In [7]:
list(df) # get a list of all the column names for your data frame

# we'll discuss this later on, but note that a list is comprised of comma-separated values inside of square brackets
# you can also use "df.columns" if you prefer, which will give you a similar output

['Unnamed: 0',
 'id',
 'company_size',
 'age',
 'sex',
 'clicked_on_ad',
 'ad_type',
 'location',
 'timestamp']

In [8]:
# let's drop that "unnamed" column 

df = df.drop(df.columns[[0]],axis=1)

In [9]:
list(df)

['id',
 'company_size',
 'age',
 'sex',
 'clicked_on_ad',
 'ad_type',
 'location',
 'timestamp']

In [10]:
df.describe() # get the basic statitical metrics for a data frame

Unnamed: 0,age
count,50000.0
mean,58.4073
std,23.679151
min,18.0
25%,38.0
50%,58.0
75%,79.0
max,99.0


In [11]:
df.count() # get a count of the non-NA cells for each column

id               50000
company_size     50000
age              50000
sex              50000
clicked_on_ad    50000
ad_type          50000
location         50000
timestamp        50000
dtype: int64

In [12]:
df['sex'].value_counts() # see the non-NA cells for each value in a column

N    16812
M    16608
F    16580
Name: sex, dtype: int64

In [13]:
df.info() # just some basic information on the data types (strings, integers, floats, et. cetera) for each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
id               50000 non-null object
company_size     50000 non-null object
age              50000 non-null int64
sex              50000 non-null object
clicked_on_ad    50000 non-null object
ad_type          50000 non-null object
location         50000 non-null object
timestamp        50000 non-null object
dtypes: int64(1), object(7)
memory usage: 3.1+ MB


### It's important to note that our timestamp values are being stored as 'non-null object's' and not as timestamps, as we'd like. So, let's change that: 

In [14]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [15]:
df.head()

Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
0,081217b4-1cf5-4657-8287-6db1b75462e4,large,92,M,Yes,Business,MidWest,2018-08-26 06:00:27.124290
1,d0b45a01-b73d-4f8e-bfa8-c53ea75397f1,large,56,M,Yes,Culinary,SouthWest,2011-06-01 18:54:34.815634
2,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast,2013-07-16 00:24:47.888180
3,5d09d6d4-023e-4fa1-9559-89526679e885,large,55,F,Yes,Political,NorthWest,2010-06-25 12:13:51.369878
4,b69e54e3-fc89-4c0f-8bdb-280409db173e,medium,25,N,No,Tech,US,2010-09-22 07:53:12.454909


### A bit more primary exploratory analysis:

In [16]:
df.sample() # get a random sample value from the data frame

Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
7713,01ccce31-0180-418a-9c7a-ad350f5258af,small,20,N,Yes,Luxury,NorthWest,2017-02-05 19:41:31.729420


In [17]:
df['age'] # select a single column

0        92
1        56
2        20
3        55
4        25
5        87
6        39
7        39
8        62
9        76
10       58
11       38
12       71
13       53
14       25
15       39
16       95
17       42
18       37
19       38
20       30
21       56
22       56
23       80
24       43
25       42
26       85
27       71
28       27
29       82
30       89
31       45
32       27
33       55
34       27
35       51
36       66
37       98
38       18
39       44
40       80
41       37
42       28
43       77
44       93
45       69
46       36
47       42
48       60
49       97
50       34
51       46
52       65
53       85
54       38
55       23
56       83
57       54
58       75
59       51
60       44
61       92
62       87
63       47
64       75
65       64
66       60
67       54
68       20
69       66
70       26
71       37
72       63
73       95
74       57
75       90
76       72
77       20
78       43
79       33
80       25
81       65
82       71
83  

In [18]:
df.loc[:, ['age','sex']] # select multiple columns

# .loc is used for labels/names

Unnamed: 0,age,sex
0,92,M
1,56,M
2,20,F
3,55,F
4,25,N
5,87,N
6,39,N
7,39,N
8,62,M
9,76,F


In [19]:
df.iloc[3] # get information on a single row 

# .iloc is used for position numbers

id               5d09d6d4-023e-4fa1-9559-89526679e885
company_size                                    large
age                                                55
sex                                                 F
clicked_on_ad                                     Yes
ad_type                                     Political
location                                    NorthWest
timestamp                  2010-06-25 12:13:51.369878
Name: 3, dtype: object

In [20]:
df.iloc[3,6] # get the value of the 7th column (ad_type) for the 4th row (3rd index)

'NorthWest'

In [21]:
df['age'].mean() # get the mean of a column

58.4073

In [22]:
df.sort_values(by="age",ascending=False) # sort by age

Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
25826,b69896fc-daca-4d80-8236-865a5a409cb0,small,99,M,No,Fashion,SouthWest,2005-02-13 13:56:07.397405
18815,be92ef02-1ef8-4584-be9d-9a8b6689459c,small,99,F,Yes,Tech,Canada,2016-09-21 16:11:50.024858
16099,2eeda9ae-4106-414d-ba6a-cc6caeb08055,startup,99,M,Yes,Political,NorthEast,2012-10-17 15:52:56.060877
25483,34ef06b8-5698-450d-851d-9d8ad880ab96,medium,99,N,No,Travel,NorthWest,2000-04-24 11:13:49.047057
46640,23f5db97-4071-4f10-9c58-ea437657d0a0,small,99,N,No,Tech,NorthEast,2011-02-22 17:35:44.331160
9580,15c5c9b0-21f3-4b88-bd76-07c84377175b,medium,99,N,No,Fashion,SouthAmerica,2006-11-22 08:45:57.943829
31506,f00eed26-6a74-4d8d-9024-2c330b9bb224,small,99,N,Yes,Tech,Canada,2009-04-20 21:34:49.240226
25469,f255e608-220a-470b-a1ad-2bf264601ac7,large,99,F,Yes,Business,Canada,2016-10-27 17:35:26.355933
21199,34d397cd-9f61-47c7-89e0-3d5f7ebfb34a,medium,99,N,Yes,Business,Canada,2000-02-25 19:55:25.571699
38868,6ef74d4c-aae6-4d9c-81fe-26c982659942,small,99,F,No,Luxury,US,2012-07-25 07:53:44.226284


In [23]:
df[df['age'] < 21] # see any rows where age < 21

Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
2,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast,2013-07-16 00:24:47.888180
38,2f695228-930a-4be3-ad3d-d7bcd9eec04c,medium,18,N,No,Fashion,NorthWest,2000-08-25 19:23:09.883773
68,4231063b-8972-4bfe-b9bf-6c276a7ab16a,large,20,M,No,Political,US,2013-06-13 11:49:01.853099
77,ed1fb847-e8a4-422c-b8cd-2a045f3bb8bb,startup,20,F,Yes,Fashion,Mexico,2015-06-30 00:06:07.048975
169,8b4fe56f-4547-4435-a960-80134238127f,small,19,F,Yes,Luxury,NorthWest,2015-11-19 05:03:19.210423
171,bef0e91f-ac50-4d05-bc54-4d19fa39aa5f,startup,20,M,No,Business,SouthAmerica,2000-03-04 15:29:28.600814
213,835de21d-a45e-415d-a6c7-021878fed9cf,large,19,N,Yes,Culinary,SouthAmerica,2007-03-17 06:49:02.737494
229,4451a98f-369a-4206-8fdf-81387fa62621,large,20,F,Yes,Culinary,SouthEast,2007-10-23 21:59:06.438277
264,1759bc2f-8e18-42f4-bdc5-defca827c5c5,medium,20,N,No,Culinary,Mexico,2016-03-12 04:10:14.879635
281,26f4eed2-bbfb-4d67-959a-85a9079773f0,large,19,F,Yes,Tech,NorthWest,2005-06-15 22:06:48.899107


---

## That's a lot, I know. Next class we will continue using Pandas and some other Python libraries to delve further into the world of descriptive analytics. 

## For now, take some time to review this notebook and keep practicing!