# Data analysis with pandas

In [1]:
import pandas as pd

In [3]:
# Importing 'Populations' data

Populations = 'https://s3.amazonaws.com/dss-fall2018/Populations.csv'

df = pd.read_csv (Populations)
df.head(10)

Unnamed: 0,Zip Code,Total Population,Total Males,Total Females,Total Households
0,90001,57110,28468,28642,12971
1,90002,51223,24876,26347,11731
2,90003,66266,32631,33635,15642
3,90004,62180,31302,30878,22547
4,90005,37681,19299,18382,15044
5,90006,59185,30254,28931,18617
6,90007,40920,20915,20005,11944
7,90008,32327,14477,17850,13841
8,90010,3800,1874,1926,2014
9,90011,103892,52794,51098,22168


In [4]:
df.shape

(318, 5)

We have 318 rows and 5 columns in the dataset.

### ```iloc()``` function replicates ```df.head()```

In [5]:
# Replicating df.head() using the other function like iloc()

df.iloc[0:10,:]

Unnamed: 0,Zip Code,Total Population,Total Males,Total Females,Total Households
0,90001,57110,28468,28642,12971
1,90002,51223,24876,26347,11731
2,90003,66266,32631,33635,15642
3,90004,62180,31302,30878,22547
4,90005,37681,19299,18382,15044
5,90006,59185,30254,28931,18617
6,90007,40920,20915,20005,11944
7,90008,32327,14477,17850,13841
8,90010,3800,1874,1926,2014
9,90011,103892,52794,51098,22168


By using ```iloc()``` function, you will be able to select some specific rows from the dataset.

For example, get row ```90``` to row ```110``` of df.

In [6]:
df.iloc[90:110,]

Unnamed: 0,Zip Code,Total Population,Total Males,Total Females,Total Households
90,90265,18116,9159,8957,7174
91,90266,35135,17605,17530,14038
92,90270,27372,13992,13380,6554
93,90272,22986,10952,12034,9212
94,90274,25209,12199,13010,9479
95,90275,41804,20283,21521,15618
96,90277,35293,17521,17772,16910
97,90278,40071,19848,20223,16009
98,90280,94396,46321,48075,23278
99,90290,6368,3180,3188,2612


In [7]:
# investigating whether DataFrame has any NaN value?

pd.isnull(df)

Unnamed: 0,Zip Code,Total Population,Total Males,Total Females,Total Households
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False
7,False,False,False,False,False
8,False,False,False,False,False
9,False,False,False,False,False


As you know, a Series includes a single column or row of data. So if we retrieve a single column,the result will prove it.

In [8]:
type(df["Zip Code"])

pandas.core.series.Series

# Computing aggregations

The important part of analysis of large data is computing aggregations like:```sum()```,```mean()```,```median()```, ```min()```, and ```max()```

In [9]:
df.mean()

Zip Code            90999.509434
Total Population    33345.871069
Total Males         16443.110063
Total Females       16902.761006
Total Households    10999.047170
dtype: float64

```mean()``` function finds the mean of each numerical column in a ```DataFrame```. In this example, since the mean of 'Zip Code' doesn't make sense, we should remove 'Zip Code' column from the list.

In [10]:
df.groupby('Zip Code').mean()

Unnamed: 0_level_0,Total Population,Total Males,Total Females,Total Households
Zip Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
90001,57110,28468,28642,12971
90002,51223,24876,26347,11731
90003,66266,32631,33635,15642
90004,62180,31302,30878,22547
90005,37681,19299,18382,15044
90006,59185,30254,28931,18617
90007,40920,20915,20005,11944
90008,32327,14477,17850,13841
90010,3800,1874,1926,2014
90011,103892,52794,51098,22168


In [11]:
Population_mean = df['Total Population'].mean()
Males_mean = df['Total Males'].mean()

print ("The mean of Total Population is: ", Population_mean)
print ("The mean of Total Males is: ", Males_mean)

The mean of Total Population is:  33345.87106918239
The mean of Total Males is:  16443.110062893084


By using ```axis``` keyword, we will be able to compute the mean of each row or of each column. By default, ```axis = 0```and will compute the mean of each column. If we set ```axis = 1```,the mean of each row will be computed.

In [12]:
df.mean(axis = 1)

0      43438.4
1      40835.8
2      47635.4
3      47382.2
4      36082.2
5      45398.6
6      36758.2
7      33700.6
8      19924.8
9      63992.6
10     32509.0
11     23994.6
12     21626.6
13     27081.4
14     40270.6
15     29378.2
16     40826.2
17     48455.8
18     36893.6
19     19896.8
20     48280.6
21     38511.2
22     40566.2
23     39109.4
24     50144.0
25     40451.6
26     32484.0
27     36229.2
28     35963.8
29     38873.8
        ...   
288    32530.8
289    43317.2
290    32468.2
291    33176.6
292    26253.4
293    21402.0
294    20836.6
295    35389.8
296    20240.6
297    42443.2
298    19524.8
299    44089.0
300    21091.8
301    19452.8
302    20450.2
303    22445.0
304    20145.4
305    20095.0
306    37250.8
307    51659.8
308    51267.2
309    24633.8
310    19326.2
311    52854.4
312    42222.0
313    35911.6
314    19729.0
315    27569.8
316    18888.4
317    22028.6
Length: 318, dtype: float64

# Listing of aggregation methods

```corr()``` Finds the correlation between columns in a ```DataFrame``` or ```Series```

```count()``` Counts the number of non-null values in each column.

```max()```/```min()``` Finds the highest/lowest value in each column. 

```median()``` Finds the median of each column.

```std()```/```var()``` Finds the standard deviation/variance of each column.

In [13]:
# Computing the correlation between columns

df.corr()

Unnamed: 0,Zip Code,Total Population,Total Males,Total Females,Total Households
Zip Code,1.0,-0.086736,-0.086181,-0.086977,-0.159326
Total Population,-0.086736,1.0,0.998201,0.998262,0.908346
Total Males,-0.086181,0.998201,1.0,0.992931,0.904769
Total Females,-0.086977,0.998262,0.992931,1.0,0.908676
Total Households,-0.159326,0.908346,0.904769,0.908676,1.0


In [14]:
df.count()

Zip Code            318
Total Population    318
Total Males         318
Total Females       318
Total Households    318
dtype: int64

In [15]:
df['Total Population'].max()

105549

In [16]:
df['Total Males'].mean()

16443.110062893084

In [17]:
df['Total Population'].std()

21597.734846302996

## Math Operations

In [18]:
#Dividing every value in the 'Total population' column by 2

df["Total Population"]/2

0      28555.0
1      25611.5
2      33133.0
3      31090.0
4      18840.5
5      29592.5
6      20460.0
7      16163.5
8       1900.0
9      51946.0
10     15551.5
11      5886.0
12      3502.5
13      9493.0
14     23798.0
15     11884.0
16     24655.0
17     32229.0
18     19483.5
19      1975.5
20     33589.5
21     22951.5
22     23726.0
23     21073.5
24     33934.5
25     22575.5
26     14357.0
27     19308.5
28     19658.0
29     22893.0
        ...   
288    15427.0
289    26367.5
290    15161.0
291    16362.5
292     8381.5
293     3110.0
294     2447.0
295    17766.5
296     1806.5
297    25883.5
298     1015.5
299    27183.0
300     2538.5
301      849.5
302     2088.0
303     3996.5
304     1537.0
305     1466.0
306    19670.5
307    36023.0
308    35459.0
309     6516.5
310      629.5
311    37464.5
312    25399.0
313    19079.0
314     1069.0
315     9455.0
316      194.0
317     3642.5
Name: Total Population, Length: 318, dtype: float64

In [19]:
#Differences between 'Total Females' column and 'Total Males' column

df['Total Females']-df['Total Males']

0       174
1      1471
2      1004
3      -424
4      -917
5     -1323
6      -910
7      3373
8        52
9     -1696
10    -7883
11    -3486
12    -1937
13     -680
14     2040
15    -1868
16     1770
17     1574
18      205
19    -1629
20      747
21     -171
22     2956
23      429
24    -1161
25      427
26    -3398
27     -533
28      224
29      658
       ... 
288     954
289    3069
290    1350
291    -989
292    -367
293     -52
294    -150
295     857
296      99
297     481
298     -73
299    1182
300     -93
301     -69
302   -2426
303    -179
304      12
305    -352
306    2139
307    2288
308   -4690
309    -357
310    -119
311    2101
312     686
313     736
314    -104
315     -72
316    -138
317     -21
Length: 318, dtype: int64

## Boolean Indexing

When we want to compare each value in a ```DataFrame``` to a specified value. The result shows Boolean values indicating the status of the comparison.

In our example, the mean of 'Total Population' is 33346. Let's check which 'Zip Code' is above the avarage:

In [20]:
Population_above_avarage = df['Total Population'] > 33346
Population_above_avarage

0       True
1       True
2       True
3       True
4       True
5       True
6       True
7      False
8      False
9       True
10     False
11     False
12     False
13     False
14      True
15     False
16      True
17      True
18      True
19     False
20      True
21      True
22      True
23      True
24      True
25      True
26     False
27      True
28      True
29      True
       ...  
288    False
289     True
290    False
291    False
292    False
293    False
294    False
295     True
296    False
297     True
298    False
299     True
300    False
301    False
302    False
303    False
304    False
305    False
306     True
307     True
308     True
309    False
310    False
311     True
312     True
313     True
314    False
315    False
316    False
317    False
Name: Total Population, Length: 318, dtype: bool

## Aggregation

In [21]:
# Aggregate these functions across all columns:

df.aggregate(['median','max'])

Unnamed: 0,Zip Code,Total Population,Total Males,Total Females,Total Households
median,90806.5,31490.0,15297.5,16209.0,10979.5
max,93591.0,105549.0,52794.0,53185.0,31087.0


In [22]:
# Different aggregations per column:

df.aggregate({'Total Females': ['sum','median'], 'Total Households': ['median','max']})

Unnamed: 0,Total Females,Total Households
max,,31087.0
median,16209.0,10979.5
sum,5375078.0,


## Filtering

If you want to drop some data, filtering operation is useful.

In [29]:
# Select columns which you want by name

df.filter(items = ['Zip Code','Total Males','Total Females'])

Unnamed: 0,Zip Code,Total Males,Total Females
0,90001,28468,28642
1,90002,24876,26347
2,90003,32631,33635
3,90004,31302,30878
4,90005,19299,18382
5,90006,30254,28931
6,90007,20915,20005
7,90008,14477,17850
8,90010,1874,1926
9,90011,52794,51098


## Sorting and Ranking

We can sort a dataset by row or column index. By default, ```axis = 0```and will sort row index. If we set ```axis = 1``` or ```columns```,the column index will be sorted. 

In [65]:
# Sorting the column index

df.sort_index(axis=1)

Unnamed: 0,Total Females,Total Households,Total Males,Total Population,Zip Code
0,28642,12971,28468,57110,90001
1,26347,11731,24876,51223,90002
2,33635,15642,32631,66266,90003
3,30878,22547,31302,62180,90004
4,18382,15044,19299,37681,90005
5,28931,18617,30254,59185,90006
6,20005,11944,20915,40920,90007
7,17850,13841,14477,32327,90008
8,1926,2014,1874,3800,90010
9,51098,22168,52794,103892,90011


In [72]:
# Example:
import numpy as np

obj = pd.DataFrame (np.arange(6).reshape((2,3)),index=['first','second'], columns = ['b','c','a'])
obj

Unnamed: 0,b,c,a
first,0,1,2
second,3,4,5


In [73]:
obj.sort_index()

Unnamed: 0,b,c,a
first,0,1,2
second,3,4,5


In [76]:
obj.sort_index(axis = 1)

Unnamed: 0,a,b,c
first,2,0,1
second,5,3,4


In [87]:
# Create a DataFrame to investigate the rank of the value

raw_data = [['Jason','Miller',42,4,25],['Molly','Jacobson',52,24,94],['Tina','Alison',36,31,57],['Jake','Milner',24,2,62],
            ['Amy','Cooze',73,3,70]]

df = pd.DataFrame (raw_data, columns = ['first_name', 'last_name','age','preTestScore','postTestScore'])
print (df)


  first_name last_name  age  preTestScore  postTestScore
0      Jason    Miller   42             4             25
1      Molly  Jacobson   52            24             94
2       Tina    Alison   36            31             57
3       Jake    Milner   24             2             62
4        Amy     Cooze   73             3             70


In [91]:
# Ranking the DataFrame based on the 'age'

df['age_ranked'] = df['age'].rank(ascending=1)
df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore,age_ranked
0,Jason,Miller,42,4,25,3.0
1,Molly,Jacobson,52,24,94,4.0
2,Tina,Alison,36,31,57,2.0
3,Jake,Milner,24,2,62,1.0
4,Amy,Cooze,73,3,70,5.0


You can rank in descending order like this: 

In [92]:
df['age_ranked'] = df['age'].rank(ascending=False)
df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore,age_ranked
0,Jason,Miller,42,4,25,3.0
1,Molly,Jacobson,52,24,94,2.0
2,Tina,Alison,36,31,57,4.0
3,Jake,Milner,24,2,62,5.0
4,Amy,Cooze,73,3,70,1.0
