### Cleaning column name
- Removing whitespace from start or end.
* Replacing spaces with underscore, removing special character (so we can access it with dot notation)
* Make all labels lowecase
* Shorten long label names

### Cleaning column data
* Explore the data in the column (ex. using unique() method)
* Identify patterns and special cases
* Remove/ replace character as needed (Ex. Remove non digit characters)
* Convert columns to specific type if needed
* Rename column id needed

### Counting null values by column
* `DataFrame.info()` 
* `DataFrame.isnull().sum()` will count total number of null values.
* We can remove rows with null values
* Remove any columns that have missing value
* Fill the missing value with other
* Leave as it is.

### Missing Data:
* All descriptive statistics on pandas objects exclude missing data by default.
* For numeric data pandas use NaN for missing data. This is also called sentinel value.
* Pandas uses sentinel values or NaN or None to represent missing data.
* None can only be used with type object. We can not use it for any other type of numpy or pandas. The problem with object type is any operation performed on it will be carried at python level and it will be slower. Using aggregation function will cause an error.
* NaN is special floating point value which follows IEEE floating point standard.
* All operation which includes NaN will result in NaN.
![missing_data1](images/missing_data1.jpg)

* String data in python will always have Object type.

In [14]:
import pandas as pd
import numpy as np

from pandas import DataFrame, Series

In [2]:
s1 = Series(['purvil', 'japan', np.nan, 'shailesh'])

In [3]:
s1

0      purvil
1       japan
2         NaN
3    shailesh
dtype: object

In [4]:
s1.isnull()

0    False
1    False
2     True
3    False
dtype: bool

* NA data either data does not exist or exists but not observed. Always keep your eye on missing data it can be bad data collection or it may create bias.

In [5]:
s1[0] = None

In [6]:
s1.isnull()

0     True
1    False
2     True
3    False
dtype: bool

![missing data](images/missing_data.jpg)

### Filtering out missing data

In [7]:
from numpy import nan as NA

In [8]:
s2 = Series([1, NA, 3.5, NA, 7])

In [9]:
s2.dropna() 

0    1.0
2    3.5
4    7.0
dtype: float64

* Of course other way is using boolean indexing with `pandas.isnull`,`pandas.notnull`

In [10]:
s2[s2.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

* In dataFrame by default it drops rows which contains NA.

In [11]:
f1 = DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA,NA,NA], [NA, 6.5, 3.]])

In [12]:
f1

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [13]:
f1.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [14]:
f1.dropna(how= "all") # Only drops rows that are all NA

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


* `how = "any"` will drop row or column (depends on axis) if any value is NA

In [15]:
f1.dropna(axis = 'columns', how = 'all') # To drop column use axis = 1 or axis = 'columns'

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [100]:
f2 = DataFrame(np.random.randn(7,3))

In [101]:
f2

Unnamed: 0,0,1,2
0,-0.454006,-0.939578,1.094883
1,-0.112351,1.081407,-0.740999
2,-0.149556,0.387025,0.544651
3,1.334937,-0.026839,1.176562
4,-0.957007,-1.024423,-1.857209
5,0.086567,-0.248291,0.523235
6,-0.005493,-0.918847,0.086928


In [102]:
f2.iloc[:4, 1] = NA

In [103]:
f2.iloc[:2, 2] = NA

In [104]:
f2

Unnamed: 0,0,1,2
0,-0.454006,,
1,-0.112351,,
2,-0.149556,,0.544651
3,1.334937,,1.176562
4,-0.957007,-1.024423,-1.857209
5,0.086567,-0.248291,0.523235
6,-0.005493,-0.918847,0.086928


####  `thresh`
* Keep the axis if it contains more or equal to `thresh` valid observations.

In [105]:
f2.dropna()

Unnamed: 0,0,1,2
4,-0.957007,-1.024423,-1.857209
5,0.086567,-0.248291,0.523235
6,-0.005493,-0.918847,0.086928


In [106]:
f2.dropna(thresh = 2)

Unnamed: 0,0,1,2
2,-0.149556,,0.544651
3,1.334937,,1.176562
4,-0.957007,-1.024423,-1.857209
5,0.086567,-0.248291,0.523235
6,-0.005493,-0.918847,0.086928


### Filling in missing data
* `fillna` will fill missing values with supplied constant value.

In [107]:
f2

Unnamed: 0,0,1,2
0,-0.454006,,
1,-0.112351,,
2,-0.149556,,0.544651
3,1.334937,,1.176562
4,-0.957007,-1.024423,-1.857209
5,0.086567,-0.248291,0.523235
6,-0.005493,-0.918847,0.086928


In [108]:
f2.fillna(0)

Unnamed: 0,0,1,2
0,-0.454006,0.0,0.0
1,-0.112351,0.0,0.0
2,-0.149556,0.0,0.544651
3,1.334937,0.0,1.176562
4,-0.957007,-1.024423,-1.857209
5,0.086567,-0.248291,0.523235
6,-0.005493,-0.918847,0.086928


In [109]:
f2.fillna({1:0.5, 2:0}) # Different values for each columns

Unnamed: 0,0,1,2
0,-0.454006,0.5,0.0
1,-0.112351,0.5,0.0
2,-0.149556,0.5,0.544651
3,1.334937,0.5,1.176562
4,-0.957007,-1.024423,-1.857209
5,0.086567,-0.248291,0.523235
6,-0.005493,-0.918847,0.086928


* To modify in-place,

In [110]:
f2.fillna(0, inplace = True)

In [111]:
f2

Unnamed: 0,0,1,2
0,-0.454006,0.0,0.0
1,-0.112351,0.0,0.0
2,-0.149556,0.0,0.544651
3,1.334937,0.0,1.176562
4,-0.957007,-1.024423,-1.857209
5,0.086567,-0.248291,0.523235
6,-0.005493,-0.918847,0.086928


In [112]:
f3 = DataFrame(np.random.randn(6,3))

In [113]:
f3

Unnamed: 0,0,1,2
0,0.539949,0.110458,0.928284
1,-1.205511,-0.299693,1.034078
2,-0.164941,0.93868,0.436592
3,-0.618911,-1.165438,-0.322496
4,0.155157,0.992585,0.476875
5,-0.226414,-0.534909,-0.905426


In [114]:
f3.iloc[2:, 1] = NA
f3.iloc[4:, 2] = NA

In [115]:
f3

Unnamed: 0,0,1,2
0,0.539949,0.110458,0.928284
1,-1.205511,-0.299693,1.034078
2,-0.164941,,0.436592
3,-0.618911,,-0.322496
4,0.155157,,
5,-0.226414,,


In [117]:
f3.fillna(method = 'ffill') # forward fill, propogate previous value forward. `bfill` , backward fill

Unnamed: 0,0,1,2
0,0.539949,0.110458,0.928284
1,-1.205511,-0.299693,1.034078
2,-0.164941,-0.299693,0.436592
3,-0.618911,-0.299693,-0.322496
4,0.155157,-0.299693,-0.322496
5,-0.226414,-0.299693,-0.322496


* Previous or backward value not available, result will be NA

In [118]:
f3.fillna(method = 'ffill', limit = 2)

Unnamed: 0,0,1,2
0,0.539949,0.110458,0.928284
1,-1.205511,-0.299693,1.034078
2,-0.164941,-0.299693,0.436592
3,-0.618911,-0.299693,-0.322496
4,0.155157,,-0.322496
5,-0.226414,,-0.322496


In [119]:
s2

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [120]:
s2.fillna(s2.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

![fillna](images/fillna.jpg)

-------

# Data Transformation

# Removing Duplicates
* Duplicate rows can be in DataFrame

In [36]:
f4 = DataFrame({'k1':['one', 'two'] * 3 + ['two'], 'k2': [1,1,2,3,3,4,4]})

In [37]:
f4

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [38]:
f4.duplicated() # return bolean Series indicating whether each row is duplicate or not (has been observed previously)

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [39]:
f4.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


* We can specify columns on which duplication will be considered

In [40]:
f4['v1'] = range(7)

In [41]:
f4

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [42]:
f4.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


* To keep last observed value, use `keep = 'last'`

In [43]:
f4.drop_duplicates(['k1'], keep = 'last')

Unnamed: 0,k1,k2,v1
4,one,3,4
6,two,4,6


### Transforming Data using a function or mapping

In [44]:
f5 = DataFrame({'Food': ['bacon', 'pulled pork', 'bacon', 'pastrami', 
                         'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'],
                'Onces': [4,3,12,6,7.5,8,3,5,6]})

f5

Unnamed: 0,Food,Onces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


* I want to add column to indicate animal from which meat came.

In [45]:
meat_to_animal = {'bacon': 'pig', 'pulled pork': 'pig', 'pastrami':'cow',
                  'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon'}

* `map` method on a Series accepts a function or dict-like object which contains mapping.

In [46]:
lowercased = f5['Food'].str.lower()

In [47]:
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: Food, dtype: object

In [48]:
f5['animal'] = lowercased.map(meat_to_animal)

In [49]:
f5

Unnamed: 0,Food,Onces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


* `map` is useful in element wise transformation.

In [50]:
f5['Food'].map(lambda x: meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: Food, dtype: object

### Replacing values

In [51]:
s4 = Series([1.,-999.,2.,-999.,-1000.,3.])

In [52]:
s4

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [53]:
s4.replace(-999, NA)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [54]:
s4.replace([-999., -1000.], NA)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [55]:
s4.replace([-999.,-1000.], [NA, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [56]:
s4.replace({-999.:np.nan, -1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### Renaming axis index

In [57]:
f6 = DataFrame(np.arange(12).reshape((3,4)), index=['Ohio', 'Colorado', 'New York'], 
               columns=['one', 'two', 'three', 'four'])

In [58]:
f6

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [59]:
f6.index.map(lambda x: x.upper())

Index(['OHIO', 'COLORADO', 'NEW YORK'], dtype='object')

In [60]:
f6.index = f6.index.map(lambda x: x.upper())

In [61]:
f6

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


* To create transformed version of dataset without modifying original use `rename`.

In [62]:
f6.rename(index = str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [63]:
f6

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


In [64]:
f6.rename(index = {'OHIO':'CA'}, columns={'three':3})

Unnamed: 0,one,two,3,four
CA,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


* Using `inplace = True` we can modify dataset inplace.

### Discretization and Binning
* Continuous data is often discretised or binned for analysis. Ex. Grouping data in discrete age bucket.

In [65]:
age = [20,22,25,27,21,23,37,31,61,45,41,32]

* Divide it in 18 to 25, 26 to 35, 36 to 60 and 61 and older. We can use `cut` function.

In [66]:
bins = [18,25,35,60,100]

In [67]:
cats = pd.cut(age, bins)

In [68]:
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [69]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [70]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')

In [71]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [72]:
pd.cut(age, bins, right = False) # 

[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

In [73]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [74]:
pd.cut(age, bins, labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

In [15]:
data = np.random.rand(20)

In [16]:
pd.cut(data, 4, precision=2) # precision limits decimal precision upto 2 positions

[(0.45, 0.65], (0.65, 0.85], (0.45, 0.65], (0.65, 0.85], (0.45, 0.65], ..., (0.65, 0.85], (0.45, 0.65], (0.26, 0.45], (0.45, 0.65], (0.057, 0.26]]
Length: 20
Categories (4, interval[float64]): [(0.057, 0.26] < (0.26, 0.45] < (0.45, 0.65] < (0.65, 0.85]]

In [17]:
pd.cut(data, 4, precision=2, labels=['Low', 'MED', 'Good', 'High']) # precision limits decimal precision upto 2 positions

[Good, High, Good, High, Good, ..., High, Good, MED, Good, Low]
Length: 20
Categories (4, object): [Low < MED < Good < High]

* Passing an integer number of bins to `cut` instead of explicit bin edges, it will compute equal length bins based on the minimum and maximum values in the data.

#### `qcut`
* Cuts the data based on sample quantiles. Depending on distribution of data in `cut` we might not get equal number of data points in each bin. Using `qcut` we get roughly equal size bins because it used equal quantiles.

In [77]:
data = np.random.randn(1000) # normally distributed

In [78]:
cats = pd.qcut(data, 4) # cut into quartile

In [79]:
cats

[(-0.638, 0.0589], (0.0589, 0.71], (0.0589, 0.71], (0.71, 2.826], (0.0589, 0.71], ..., (-4.0520000000000005, -0.638], (0.0589, 0.71], (-4.0520000000000005, -0.638], (-4.0520000000000005, -0.638], (0.0589, 0.71]]
Length: 1000
Categories (4, interval[float64]): [(-4.0520000000000005, -0.638] < (-0.638, 0.0589] < (0.0589, 0.71] < (0.71, 2.826]]

In [80]:
pd.value_counts(cats) # option dropna=False which stops method from excluding null values when it make calculation

(0.71, 2.826]                    250
(0.0589, 0.71]                   250
(-0.638, 0.0589]                 250
(-4.0520000000000005, -0.638]    250
dtype: int64

### Detecting and Filtering outliers

In [81]:
f7 = DataFrame(np.random.randn(1000, 4))

In [82]:
f7.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.015581,0.042649,0.026861,0.033918
std,1.015832,0.99502,0.989489,0.99878
min,-3.743163,-2.876459,-2.868134,-3.068202
25%,-0.659204,-0.63842,-0.659718,-0.647716
50%,-0.015443,0.080847,0.047371,0.057847
75%,0.711525,0.697676,0.685383,0.709581
max,3.521572,3.857762,3.392503,3.320866


* Find values in columns exceeding 3 in absolute value

In [83]:
col = f7[2]

In [84]:
col[np.abs(col) > 3]

671    3.233454
954    3.392503
Name: 2, dtype: float64

In [85]:
f7[np.abs(f7[2]) > 3]

Unnamed: 0,0,1,2,3
671,1.570912,-0.968068,3.233454,0.523925
954,-1.014976,1.296221,3.392503,1.012706


* To select all rows having a value exceeding 3 or -3, use `any`

In [86]:
f7[((np.abs(f7)) > 3).any(1)]

Unnamed: 0,0,1,2,3
25,-0.764958,-0.717719,0.073299,3.320866
478,3.353005,0.327776,1.426488,-1.824012
491,3.521572,0.938357,0.63484,0.422669
514,0.576551,-0.257346,0.776172,3.212352
589,3.086277,-1.120329,1.037408,0.009
628,-0.580425,3.857762,1.193073,1.120088
671,1.570912,-0.968068,3.233454,0.523925
674,3.372624,-1.067934,-1.457845,0.146544
691,-3.743163,-1.556565,-0.096032,0.925127
739,-0.451385,0.615533,-0.432775,3.088132


In [87]:
f7[np.abs(f7) > 3] = np.sign(f7) * 3

In [88]:
f7.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.015247,0.041791,0.026235,0.033365
std,1.008342,0.992094,0.987498,0.996643
min,-3.0,-2.876459,-2.868134,-3.0
25%,-0.659204,-0.63842,-0.659718,-0.647716
50%,-0.015443,0.080847,0.047371,0.057847
75%,0.711525,0.697676,0.685383,0.709581
max,3.0,3.0,3.0,3.0


* `np.sign(f7)` produces 1 or -1 based on sign of value in data

### Permutation and Random sampling

* `np.random.permutation` is useful for permuting series or the rows in a DataFrame.
* Calling it with length of axis you want to permute produces an array of integers indicating the new ordering.

In [89]:
f8 = DataFrame(np.arange(5 * 4).reshape((5,4)))

In [90]:
f8

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [91]:
sampler = np.random.permutation(5)

In [92]:
sampler

array([4, 0, 3, 2, 1])

In [93]:
f8.iloc[sampler]

Unnamed: 0,0,1,2,3
4,16,17,18,19
0,0,1,2,3
3,12,13,14,15
2,8,9,10,11
1,4,5,6,7


* To select random subset without replacement, you can use the sample method on Series or DataFrame.

In [94]:
f8.sample(n = 3)

Unnamed: 0,0,1,2,3
2,8,9,10,11
4,16,17,18,19
0,0,1,2,3


In [95]:
f8.sample(n = 3, replace=True) # sample with replacement

Unnamed: 0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
0,0,1,2,3


### Computing indicator/dummy values
* Converting categorical variable in dummy or indicator matrix.

In [123]:
f9 = pd.DataFrame({'key':['b','b','a','c', 'a', 'b'], 'data1':range(6)})

In [124]:
f9

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


#### `get_dummies`

In [126]:
pd.get_dummies(f9['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [128]:
dummies = pd.get_dummies(f9['key'], prefix='key')

In [129]:
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [132]:
f9[['data1']].join(dummies)

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


### String Object Methods

* We can apply string method or regex on each value using `data.map` but it will fail when it encounters NA values. To deal with it and skip NA values use series oriented methods.

In [162]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com', 'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [163]:
data = pd.Series(data)

In [164]:
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [166]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [167]:
data.str[:5] # slice strings

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

![string_methods](images/string_methods.jpg)

### Tidy data
* Column represent separate variable
* Rows represent individual observations
* Observational units forms table.
* Problem: COlumn contains values instead of variable
    - Fix it with `pd.melt()`

In [14]:
my_df = pd.DataFrame({'name':['Danial', 'John', 'Jane'],
                     'treatment a': [np.nan, 12,24],
                     'treatment b': [42,31,27]})

In [15]:
my_df

Unnamed: 0,name,treatment a,treatment b
0,Danial,,42
1,John,12.0,31
2,Jane,24.0,27


In [16]:
pd.melt(my_df, id_vars='name', value_vars=['treatment a', 'treatment b'])

Unnamed: 0,name,variable,value
0,Danial,treatment a,
1,John,treatment a,12.0
2,Jane,treatment a,24.0
3,Danial,treatment b,42.0
4,John,treatment b,31.0
5,Jane,treatment b,27.0


* `id_vars` : columns you want to hold constant
* `value_vars` : which column want to melt

In [17]:
pd.melt(my_df, id_vars='name', value_vars=['treatment a', 'treatment b'],
       var_name = 'treatment', value_name = 'result')

Unnamed: 0,name,treatment,result
0,Danial,treatment a,
1,John,treatment a,12.0
2,Jane,treatment a,24.0
3,Danial,treatment b,42.0
4,John,treatment b,31.0
5,Jane,treatment b,27.0


* Pivot is un-melting data
* Pivoting=: turn unique values into separate columns
* Analysis friendly shape to reporting friendly shape
* Useful to tidy dataset when multiple variables are stored in the same column.

In [4]:
from datetime import datetime

In [5]:
my_df2 = pd.DataFrame({'date': [datetime(2010,1,30),datetime(2010,1,30),datetime(2010,2,2),datetime(2010,2,2)],
                      'element':['tmax','tmin','tmax','tmin'],
                      'value':[27.8,14.5,27.3,14.4]})

In [6]:
my_df2

Unnamed: 0,date,element,value
0,2010-01-30,tmax,27.8
1,2010-01-30,tmin,14.5
2,2010-02-02,tmax,27.3
3,2010-02-02,tmin,14.4


In [7]:
my_df2.pivot(index = 'date', columns='element', values='value')

element,tmax,tmin
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-30,27.8,14.5
2010-02-02,27.3,14.4


* index : which column to fix in pivot
* columns: Columns we want to pivot into new columns
* values: Values to be used to fill new columns

In [8]:
my_df3 = pd.DataFrame({'date': [datetime(2010,1,30),datetime(2010,1,30),datetime(2010,2,2),datetime(2010,2,2)],
                      'element':['tmax','tmin','tmin','tmin'],
                      'value':[27.8,14.5,27.3,14.4]})

In [9]:
my_df3

Unnamed: 0,date,element,value
0,2010-01-30,tmax,27.8
1,2010-01-30,tmin,14.5
2,2010-02-02,tmin,27.3
3,2010-02-02,tmin,14.4


* Here there are two `tmin` for same date `2010-02-02`. We can not use pivot. Python has no knowledge to resolve duplicate value.
* Use `pivot_table` instead.

In [10]:
my_df3.pivot_table(index='date', columns='element', values='value', aggfunc=np.mean)

element,tmax,tmin
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-30,27.8,14.5
2010-02-02,,20.85


In [11]:
my_df3.pivot_table(index = 'date', columns='element', values='value', aggfunc=np.min)

element,tmax,tmin
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-30,27.8,14.5
2010-02-02,,14.4


In [13]:
my_df3.pivot_table(index = 'date', columns='element', values='value', aggfunc=sum, margins=True)

element,tmax,tmin,All
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-30 00:00:00,27.8,14.5,42.3
2010-02-02 00:00:00,,41.7,41.7
All,27.8,56.2,84.0


* `aggfunc` tells python that how to handle multiple values. Default is `np.mean`

* Columns contains multiple bits of information

In [48]:
my_df4 = pd.DataFrame({'country':['AD', 'AE', 'AF'],
                       'year':[2000,2000,2000],
                       'm014':[0,2,52],'m1524':[0,4,228]})

In [49]:
my_df4

Unnamed: 0,country,year,m014,m1524
0,AD,2000,0,0
1,AE,2000,2,4
2,AF,2000,52,228


* `m014` shows male age 0 to 14

* Melt it down so sex age in same column

In [52]:
my_df4_melt = pd.melt(my_df4, id_vars=['country', 'year'])

In [53]:
my_df4_melt

Unnamed: 0,country,year,variable,value
0,AD,2000,m014,0
1,AE,2000,m014,2
2,AF,2000,m014,52
3,AD,2000,m1524,0
4,AE,2000,m1524,4
5,AF,2000,m1524,228


* Why we want to make them different?
    - As for some model we want be able to use age and sex as independent predictors

In [54]:
my_df4_melt['sex'] = my_df4_melt['variable'].str[0]

In [55]:
my_df4_melt

Unnamed: 0,country,year,variable,value,sex
0,AD,2000,m014,0,m
1,AE,2000,m014,2,m
2,AF,2000,m014,52,m
3,AD,2000,m1524,0,m
4,AE,2000,m1524,4,m
5,AF,2000,m1524,228,m


-------- 

In [57]:
my_df5 = pd.DataFrame({'name':['Danial', 'John', 'Jane'],
                     'treatment a': ['-', 12,24],
                     'treatment b': [42,31,27]})

In [58]:
my_df5

Unnamed: 0,name,treatment a,treatment b
0,Danial,-,42
1,John,12,31
2,Jane,24,27


In [59]:
my_df5.dtypes

name           object
treatment a    object
treatment b     int64
dtype: object

* Sometimes we want to convert one type to other

```
df['treatment b'] = df['treatment b'].astype(str)

df['sex'] = sf['sex'].astype('category') # convert column to categorical variable
# It make dataframe smaller in memory
```

In [60]:
my_df5['treatment a'] = pd.to_numeric(my_df5['treatment a'], errors = 'coerce')
# Invalid value will set as NaN.

In [61]:
my_df5.dtypes

name            object
treatment a    float64
treatment b      int64
dtype: object

In [62]:
my_df5

Unnamed: 0,name,treatment a,treatment b
0,Danial,,42
1,John,12.0,31
2,Jane,24.0,27


-------------

In [63]:
import re

pattern = re.compile('^\$\d*\.\d{2}$')

In [64]:
result = pattern.match('$17.49')

In [66]:
bool(result)

True

#### Converting to categorical variable

In [18]:
df = pd.DataFrame(['A+','A','A-','B+','B','B-','C+','C','C-','D+','D'],
                  index=['excellent','excellent','excellent', 'good', 'good', 'good', 
                         'ok', 'ok', 'ok', 'poor', 'poor'])
df.rename(columns= {0:'Grades'}, inplace=True)

In [19]:
df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


In [24]:
df['Grades'].astype('category')

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
ok            C
ok           C-
poor         D+
poor          D
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, ..., C+, C-, D, D+]

In [25]:
grades = df['Grades'].astype('category', categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],
                            ordered=True)

  


In [26]:
grades

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
ok            C
ok           C-
poor         D+
poor          D
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

In [27]:
grades>'C'

excellent     True
excellent     True
excellent     True
good          True
good          True
good          True
ok            True
ok           False
ok           False
poor         False
poor         False
Name: Grades, dtype: bool