<a href="https://colab.research.google.com/github/lblogan14/python_data_analysis/blob/master/Chapter6_pandas_in_depth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



*   Data Preparation
*   Data transformation
*   Data Aggregation



In [0]:
import numpy as np
import pandas as pd

#Data Preparation

##Merging
consists of a combination of data through the connection of rows using one or more keys

Use **merge()** function to combine tables

In [2]:
frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62]})
frame1

Unnamed: 0,id,price
0,ball,12.33
1,pencil,11.44
2,pen,33.21
3,mug,13.23
4,ashtray,33.62


In [3]:
frame2 = pd.DataFrame( {'id':['pencil','pencil','ball','pen'],
                        'color': ['white','red','red','black']})
frame2

Unnamed: 0,color,id
0,white,pencil
1,red,pencil
2,red,ball
3,black,pen


In [4]:
pd.merge(frame1, frame2)

Unnamed: 0,id,price,color
0,ball,12.33,red
1,pencil,11.44,white
2,pencil,11.44,red
3,pen,33.21,black


The returned dataframe consists of all rows that have an ID in common. *mug* and *ashtray* are dropped.

In most cases, you need to decide which is the column on which to base the merging. To achieve this, you need to add the **on** option with the column names as the key for the merging. See the example below

In [5]:
frame1 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'color': ['white','red','red','black','green'],
                        'brand': ['OMG','ABC','ABC','POD','POD']})
frame1

Unnamed: 0,brand,color,id
0,OMG,white,ball
1,ABC,red,pencil
2,ABC,red,pen
3,POD,black,mug
4,POD,green,ashtray


In [6]:
frame2 = pd.DataFrame( {'id':['pencil','pencil','ball','pen'],
                        'brand': ['OMG','POD','ABC','POD']})
frame2

Unnamed: 0,brand,id
0,OMG,pencil
1,POD,pencil
2,ABC,ball
3,POD,pen


In this case, frame1 and frame2 have two columns with the same, and if you use **merge**, the error occurs....

In [7]:
pd.merge(frame1, frame2)

Unnamed: 0,brand,color,id


So you need to explicitly define the criteria for merging

In [8]:
pd.merge(frame1, frame2, on='id')

Unnamed: 0,brand_x,color,id,brand_y
0,OMG,white,ball,ABC
1,ABC,red,pencil,OMG
2,ABC,red,pencil,POD
3,ABC,red,pen,POD


In [9]:
pd.merge(frame1, frame2, on='brand')

Unnamed: 0,brand,color,id_x,id_y
0,OMG,white,ball,pencil
1,ABC,red,pencil,ball
2,ABC,red,pen,ball
3,POD,black,mug,pencil
4,POD,black,mug,pen
5,POD,green,ashtray,pencil
6,POD,green,ashtray,pen


To merge two dataframes which do not have the same name, you need to use the **left_on** and **right_on** options to specify the key columns for the first and the for the second dataframe, respectively...

In [13]:
frame2.columns = ['brand','sid']
frame2

Unnamed: 0,brand,sid
0,OMG,pencil
1,POD,pencil
2,ABC,ball
3,POD,pen


In [14]:
pd.merge(frame1, frame2, left_on= 'id', right_on='sid')

Unnamed: 0,brand_x,color,id,brand_y,sid
0,OMG,white,ball,ABC,ball
1,ABC,red,pencil,OMG,pencil
2,ABC,red,pencil,POD,pencil
3,ABC,red,pen,POD,pen


By default, the **merge()** function merforms an *inner join*. Other possible options are the *left join*, the *right join*, and the *outer join*

To select the type of join you have to use the **how** option

In [0]:
frame2.columns = ['brand','id']

The default *inner join*,

In [16]:
pd.merge(frame1, frame2, on='id')

Unnamed: 0,brand_x,color,id,brand_y
0,OMG,white,ball,ABC
1,ABC,red,pencil,OMG
2,ABC,red,pencil,POD
3,ABC,red,pen,POD


The *outer join*,

In [17]:
pd.merge(frame1, frame2, on='id', how='outer')

Unnamed: 0,brand_x,color,id,brand_y
0,OMG,white,ball,ABC
1,ABC,red,pencil,OMG
2,ABC,red,pencil,POD
3,ABC,red,pen,POD
4,POD,black,mug,
5,POD,green,ashtray,


The *left join*,

In [18]:
pd.merge(frame1, frame2, on='id', how='left')

Unnamed: 0,brand_x,color,id,brand_y
0,OMG,white,ball,ABC
1,ABC,red,pencil,OMG
2,ABC,red,pencil,POD
3,ABC,red,pen,POD
4,POD,black,mug,
5,POD,green,ashtray,


The *right join*,

In [19]:
pd.merge(frame1, frame2, on='id', how='right')

Unnamed: 0,brand_x,color,id,brand_y
0,OMG,white,ball,ABC
1,ABC,red,pencil,OMG
2,ABC,red,pencil,POD
3,ABC,red,pen,POD


To merge multiple keys,

In [20]:
pd.merge(frame1, frame2, on=['id', 'brand'], how='outer')

Unnamed: 0,brand,color,id
0,OMG,white,ball
1,ABC,red,pencil
2,ABC,red,pen
3,POD,black,mug
4,POD,green,ashtray
5,OMG,,pencil
6,POD,,pencil
7,ABC,,ball
8,POD,,pen


###Merging on an Index
The previous examples performed **merge()** on the columns as keys, you can also use indexes as keys for merging, by setting the **left_index** or **right_index** options to **True** to activate them

In [21]:
pd.merge(frame1, frame2, right_index=True, left_index=True)

Unnamed: 0,brand_x,color,id_x,brand_y,id_y
0,OMG,white,ball,OMG,pencil
1,ABC,red,pencil,POD,pencil
2,ABC,red,pen,ABC,ball
3,POD,black,mug,POD,pen


The **join()** function is more convenient to perform merging by indexes, however, you need to make sure the column names in two dataframes are not the same.

In [0]:
frame1.join(frame2) #Error because two 'id' names are the same in frame1 and frame2

In [0]:
frame2.columns = ['brand2','id2']

In [25]:
frame1.join(frame2)

Unnamed: 0,brand,color,id,brand2,id2
0,OMG,white,ball,OMG,pencil
1,ABC,red,pencil,POD,pencil
2,ABC,red,pen,ABC,ball
3,POD,black,mug,POD,pen
4,POD,green,ashtray,,


##Concatenating

**concatenate()** function from NumPy

In [26]:
array1 = np.arange(9).reshape(3,3)
array1

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [27]:
array2 = np.arange(9).reshape(3,3) + 6
array2

array([[ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [29]:
np.concatenate([array1, array2], axis=1) #axis=1 means to concatenate along column growing direction -->

array([[ 0,  1,  2,  6,  7,  8],
       [ 3,  4,  5,  9, 10, 11],
       [ 6,  7,  8, 12, 13, 14]])

In [30]:
np.concatenate([array1, array2], axis=0)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

**concat()** function from Pandas

In [31]:
ser1 = pd.Series(np.random.rand(4), index=[1,2,3,4])
ser1

1    0.263465
2    0.668358
3    0.728427
4    0.134150
dtype: float64

In [32]:
ser2 = pd.Series(np.random.rand(4), index=[5,6,7,8])
ser2

5    0.944954
6    0.493830
7    0.859512
8    0.483135
dtype: float64

In [33]:
pd.concat([ser1, ser2])

1    0.263465
2    0.668358
3    0.728427
4    0.134150
5    0.944954
6    0.493830
7    0.859512
8    0.483135
dtype: float64

By default, **axis=0** in **concat()** function. If **axis=1**, then the result will be a dataframe as shown below.

In [34]:
pd.concat([ser1, ser2], axis=1)

Unnamed: 0,0,1
1,0.263465,
2,0.668358,
3,0.728427,
4,0.13415,
5,,0.944954
6,,0.49383
7,,0.859512
8,,0.483135


If you want to create a hierarchical index on the axis of concatenation, you have to use the **key** option

In [35]:
pd.concat([ser1, ser2], keys=[1,2])

1  1    0.263465
   2    0.668358
   3    0.728427
   4    0.134150
2  5    0.944954
   6    0.493830
   7    0.859512
   8    0.483135
dtype: float64

If **axis=1** in this case, the **keys** become the column headers of the dataframe

In [36]:
pd.concat([ser1, ser2], keys=[1,2], axis=1)

Unnamed: 0,1,2
1,0.263465,
2,0.668358,
3,0.728427,
4,0.13415,
5,,0.944954
6,,0.49383
7,,0.859512
8,,0.483135


Same fashion for the dataframes...

In [37]:
frame1 = pd.DataFrame(np.random.rand(9).reshape(3,3), index=[1,2,3],
                      columns=['A','B','C'])
frame2 = pd.DataFrame(np.random.rand(9).reshape(3,3), index=[4,5,6],
                      columns=['A','B','C'])
pd.concat([frame1, frame2])

Unnamed: 0,A,B,C
1,0.392675,0.146702,0.421475
2,0.22465,0.260781,0.131519
3,0.305308,0.279102,0.074014
4,0.310741,0.545343,0.416978
5,0.522835,0.74203,0.433634
6,0.264633,0.125348,0.918616


In [38]:
pd.concat([frame1, frame2], axis=1)

Unnamed: 0,A,B,C,A.1,B.1,C.1
1,0.392675,0.146702,0.421475,,,
2,0.22465,0.260781,0.131519,,,
3,0.305308,0.279102,0.074014,,,
4,,,,0.310741,0.545343,0.416978
5,,,,0.522835,0.74203,0.433634
6,,,,0.264633,0.125348,0.918616


##Combining
You can combine two datasets with different dimensions. In this case, you can't use **merge()** or **concatenate()**

The **combine_first()** function can perform this data alignment

In [39]:
ser1 = pd.Series(np.random.rand(5),index=[1,2,3,4,5])
ser1

1    0.163401
2    0.600516
3    0.191775
4    0.175329
5    0.113567
dtype: float64

In [40]:
ser2 = pd.Series(np.random.rand(4),index=[2,4,5,6])
ser2

2    0.752758
4    0.047720
5    0.639549
6    0.477362
dtype: float64

In [41]:
ser1.combine_first(ser2)

1    0.163401
2    0.600516
3    0.191775
4    0.175329
5    0.113567
6    0.477362
dtype: float64

In [42]:
ser2.combine_first(ser1)

1    0.163401
2    0.752758
3    0.191775
4    0.047720
5    0.639549
6    0.477362
dtype: float64

**combine_first()** will use the indexes to combine two datasets 

For partial overlap,

In [43]:
ser1[:3].combine_first(ser2[:3])

1    0.163401
2    0.600516
3    0.191775
4    0.047720
5    0.639549
dtype: float64

##Pivoting
to unify the values collectedd from different sources


###Pivoting with Hierarchical Indexing


*   Stacking -- Rotates or pivots the data structure converting columns to rows
*   Unstacking -- Converts rows into columns


In [44]:
frame1 = pd.DataFrame(np.arange(9).reshape(3,3),
                      index=['white','black','red'],
                      columns=['ball','pen','pencil'])
frame1

Unnamed: 0,ball,pen,pencil
white,0,1,2
black,3,4,5
red,6,7,8


**stack()** on dataframe

In [46]:
ser5 = frame1.stack()
ser5

white  ball      0
       pen       1
       pencil    2
black  ball      3
       pen       4
       pencil    5
red    ball      6
       pen       7
       pencil    8
dtype: int64

**unstack()** on hierarchically indexed series

In [47]:
ser5.unstack()

Unnamed: 0,ball,pen,pencil
white,0,1,2
black,3,4,5
red,6,7,8


In [48]:
ser5.unstack(0)

Unnamed: 0,white,black,red
ball,0,3,6
pen,1,4,7
pencil,2,5,8


###Pivoting from "Long" to "Wide" Format

In [49]:
longframe = pd.DataFrame({ 'color':['white','white','white',
                                    'red','red','red',
                                    'black','black','black'],
                          'item':['ball','pen','mug',
                                  'ball','pen','mug',
                                  'ball','pen','mug'],
                          'value': np.random.rand(9)})
longframe

Unnamed: 0,color,item,value
0,white,ball,0.028005
1,white,pen,0.030776
2,white,mug,0.78456
3,red,ball,0.267798
4,red,pen,0.624344
5,red,mug,0.080814
6,black,ball,0.611194
7,black,pen,0.457765
8,black,mug,0.486705


The **pivot()** function transforms the dataframe into more readable dataframe by specifying the column names as arguments

In [50]:
wideframe = longframe.pivot('color','item')
wideframe

Unnamed: 0_level_0,value,value,value
item,ball,mug,pen
color,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
black,0.611194,0.486705,0.457765
red,0.267798,0.080814,0.624344
white,0.028005,0.78456,0.030776


##Removing

In [54]:
frame1 = pd.DataFrame(np.arange(9).reshape(3,3),
                      index=['white','black','red'],
                      columns=['ball','pen','pencil'])
frame1

Unnamed: 0,ball,pen,pencil
white,0,1,2
black,3,4,5
red,6,7,8


**del** command to remove a column

In [55]:
del frame1['ball']
frame1

Unnamed: 0,pen,pencil
white,1,2
black,4,5
red,7,8


**drop()** function to remove a row

In [56]:
frame1.drop('white')

Unnamed: 0,pen,pencil
black,4,5
red,7,8


#Data Transformation

##Removing Duplicates

In [57]:
dframe = pd.DataFrame({ 'color': ['white','white','red','red','white'],
                       'value': [2,1,3,3,2]})
dframe

Unnamed: 0,color,value
0,white,2
1,white,1
2,red,3
3,red,3
4,white,2


**duplicated()** detects the rows that appear to be duplicated and returns a series of Booleans, with **True** for duplicated, and with **False** if there are no duplicates in the previous elements

In [58]:
dframe.duplicated()

0    False
1    False
2    False
3     True
4     True
dtype: bool

To see which elements are duplicated:

In [59]:
dframe[dframe.duplicated()]

Unnamed: 0,color,value
3,red,3
4,white,2


**drop_duplicates()** returns the dataframes without duplicate rows

In [60]:
dframe.drop_duplicates()

Unnamed: 0,color,value
0,white,2
1,white,1
2,red,3


##Mapping
To define mapping there is no better object than *dict* objects.

    map = {
             'lable1': 'value1',
             'label2': 'value2',
             ...
    }



*   **replace()** -- Replaces values
*   **map()** -- Creates a new column
*   **rename()** -- Replaces the index values



###Replacing Values via Mapping

In [61]:
frame = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                      'color':['white','rosso','verde','black','yellow'],
                      'price':[5.56,4.20,1.30,0.56,2.75]})
frame

Unnamed: 0,color,item,price
0,white,ball,5.56
1,rosso,mug,4.2
2,verde,pen,1.3
3,black,pencil,0.56
4,yellow,ashtray,2.75


Define a mapping:

In [0]:
newcolors = {
    'rosso': 'red',
    'verde': 'green'
}

Use **replace()** with the mapping

In [63]:
frame.replace(newcolors)

Unnamed: 0,color,item,price
0,white,ball,5.56
1,red,mug,4.2
2,green,pen,1.3
3,black,pencil,0.56
4,yellow,ashtray,2.75


Replace NaN with 0,

In [64]:
ser = pd.Series([1,3,np.nan,4,6,np.nan,3])
ser

0    1.0
1    3.0
2    NaN
3    4.0
4    6.0
5    NaN
6    3.0
dtype: float64

In [65]:
ser.replace(np.nan, 0)

0    1.0
1    3.0
2    0.0
3    4.0
4    6.0
5    0.0
6    3.0
dtype: float64

###Adding Values via Mapping

In [66]:
frame = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                      'color':['white','red','green','black','yellow']})
frame

Unnamed: 0,color,item
0,white,ball
1,red,mug
2,green,pen
3,black,pencil
4,yellow,ashtray


Deinfe a price mapping,

In [0]:
prices = {
    'ball' : 5.56,
    'mug' : 4.20,
    'bottle' : 1.30,
    'scissors' : 3.41,
    'pen' : 1.30,
    'pencil' : 0.56,
    'ashtray' : 2.75
}

**map()** applies to a series or to a column of a dataframe.

In this case, you can apply the mapping of the prices on the column item, making sure to add a column to the price dataframe

In [69]:
frame['price'] = frame['item'].map(prices)  #mapp prices to item in frame
frame

Unnamed: 0,color,item,price
0,white,ball,5.56
1,red,mug,4.2
2,green,pen,1.3
3,black,pencil,0.56
4,yellow,ashtray,2.75


###Rename the Indexes of the Axes

create a map first

In [0]:
reindex = {
    0: 'first',
    1: 'second',
    2: 'third',
    3: 'fourth',
    4: 'fifth'
}

The **rename()** function by default will rename the indexes

In [71]:
frame

Unnamed: 0,color,item,price
0,white,ball,5.56
1,red,mug,4.2
2,green,pen,1.3
3,black,pencil,0.56
4,yellow,ashtray,2.75


In [72]:
frame.rename(reindex)

Unnamed: 0,color,item,price
first,white,ball,5.56
second,red,mug,4.2
third,green,pen,1.3
fourth,black,pencil,0.56
fifth,yellow,ashtray,2.75


To rename columns, use the **columns** option in **rename()

In [0]:
recolumn = {
    'item': 'object',
    'price': 'value'
}

In [74]:
frame.rename(index=reindex, columns=recolumn)

Unnamed: 0,color,object,value
first,white,ball,5.56
second,red,mug,4.2
third,green,pen,1.3
fourth,black,pencil,0.56
fifth,yellow,ashtray,2.75


To replace just one value,

In [75]:
frame.rename(index={1:'first'}, columns={'item':'object'})

Unnamed: 0,color,object,price
0,white,ball,5.56
first,red,mug,4.2
2,green,pen,1.3
3,black,pencil,0.56
4,yellow,ashtray,2.75


To take effect on the object on which you call the function, you will set the **inplace()** option to **True**

In [77]:
frame.rename(columns={'item':'object'}, inplace=True)
frame

Unnamed: 0,color,object,price
0,white,ball,5.56
1,red,mug,4.2
2,green,pen,1.3
3,black,pencil,0.56
4,yellow,ashtray,2.75


##Discretization and Binning

In [0]:
results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]

Assume the values have the range from 0 to 100 and expected to be divided into four equal parts (bins).

In [0]:
bins = [0,25,50,75,100]

The **cut()** function is a special object of *Categorical* type, it contains a **categories** array indicating the names of the different internal categories and a **codes** array that contains a list of numbers equal to the elements of **results**

In [4]:
cat = pd.cut(results, bins)
cat

[(0, 25], (25, 50], (50, 75], (50, 75], (25, 50], ..., (75, 100], (0, 25], (25, 50], (75, 100], (75, 100]]
Length: 17
Categories (4, interval[int64]): [(0, 25] < (25, 50] < (50, 75] < (75, 100]]

In [5]:
cat.categories

IntervalIndex([(0, 25], (25, 50], (50, 75], (75, 100]]
              closed='right',
              dtype='interval[int64]')

In [6]:
cat.codes

array([0, 1, 2, 2, 1, 3, 3, 0, 0, 2, 2, 1, 3, 0, 1, 3, 3], dtype=int8)

**value_counts()** -- the occurrences for each bin

In [7]:
pd.value_counts(cat)

(75, 100]    5
(50, 75]     4
(25, 50]     4
(0, 25]      4
dtype: int64

To give names to various bins, use the **labels** option inside the **cut()**

In [9]:
bin_names = ['unlikely','less likely','likely','highly likely']
pd.cut(results, bins, labels=bin_names)

[unlikely, less likely, likely, likely, less likely, ..., highly likely, unlikely, less likely, highly likely, highly likely]
Length: 17
Categories (4, object): [unlikely < less likely < likely < highly likely]

**cut()** is passed as an argument to an integer like, cut(result, 5), this will divide the range of values of the array in many intervals as specified by the number.

In [10]:
pd.cut(results,5)

[(2.904, 22.2], (22.2, 41.4], (60.6, 79.8], (41.4, 60.6], (22.2, 41.4], ..., (79.8, 99.0], (22.2, 41.4], (41.4, 60.6], (79.8, 99.0], (79.8, 99.0]]
Length: 17
Categories (5, interval[float64]): [(2.904, 22.2] < (22.2, 41.4] < (41.4, 60.6] < (60.6, 79.8] <
                                    (79.8, 99.0]]

**qcut()** divides the sample directly into quintiles. **qcut()** ensures the number of occureences for each bin is equal, but the edges of each bin vary

In [15]:
pd.qcut(results, 5)

[(2.999, 24.0], (24.0, 46.0], (62.6, 87.0], (46.0, 62.6], (24.0, 46.0], ..., (62.6, 87.0], (2.999, 24.0], (46.0, 62.6], (87.0, 99.0], (62.6, 87.0]]
Length: 17
Categories (5, interval[float64]): [(2.999, 24.0] < (24.0, 46.0] < (46.0, 62.6] < (62.6, 87.0] <
                                    (87.0, 99.0]]

In [16]:
pd.value_counts(pd.qcut(results, 5))

(62.6, 87.0]     4
(2.999, 24.0]    4
(87.0, 99.0]     3
(46.0, 62.6]     3
(24.0, 46.0]     3
dtype: int64

###Detecting and Filtering Outliers

create 1000-by-3 random values from normal distribution

In [17]:
randframe = pd.DataFrame(np.random.randn(1000,3))
randframe.describe()

Unnamed: 0,0,1,2
count,1000.0,1000.0,1000.0
mean,0.016707,0.044275,-0.002577
std,1.012068,1.019511,1.021862
min,-3.819105,-2.975727,-3.126504
25%,-0.678127,-0.687734,-0.687277
50%,0.019106,0.03752,-0.016273
75%,0.684618,0.7611,0.667789
max,3.46331,3.124706,3.523281


In [18]:
randframe.std()

0    1.012068
1    1.019511
2    1.021862
dtype: float64

The **any()** function returns whether any element is True over requested axis

In [20]:
criteria = np.abs(randframe) > (3*randframe.std())
randframe[criteria.any(1)]

Unnamed: 0,0,1,2
63,-1.004557,-0.456946,3.523281
124,3.46331,-1.330892,0.847611
300,-3.819105,-0.312284,-0.023131
434,3.229023,0.677766,-1.483561
477,-3.057823,1.454993,-0.133305
757,-3.408064,-0.010713,2.006312
912,1.456711,3.124706,-0.421396
933,1.538014,1.234787,-3.126504
978,1.646848,-0.352369,3.277556


###Permutation
random reordering of a series or the rows of a dataframe

**numpy.random.permutation()**

In [26]:
nframe = pd.DataFrame(np.arange(49).reshape(7,7))
nframe

Unnamed: 0,0,1,2,3,4,5,6
0,0,1,2,3,4,5,6
1,7,8,9,10,11,12,13
2,14,15,16,17,18,19,20
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34
5,35,36,37,38,39,40,41
6,42,43,44,45,46,47,48


In [27]:
new_order = np.random.permutation(7)
new_order

array([5, 2, 0, 3, 6, 4, 1])

The **take()** function in the dataframe will apply that order to the dataframe on all lines

In [28]:
nframe.take(new_order)

Unnamed: 0,0,1,2,3,4,5,6
5,35,36,37,38,39,40,41
2,14,15,16,17,18,19,20
0,0,1,2,3,4,5,6
3,21,22,23,24,25,26,27
6,42,43,44,45,46,47,48
4,28,29,30,31,32,33,34
1,7,8,9,10,11,12,13


To apply a portion of the entire dataframe to a permutation,

In [29]:
new_order = [3,4,2]
nframe.take(new_order)

Unnamed: 0,0,1,2,3,4,5,6
3,21,22,23,24,25,26,27
4,28,29,30,31,32,33,34
2,14,15,16,17,18,19,20


###Random Sampling

In [35]:
len(nframe) # number of rows in dataframe

7

In [32]:
sample = np.random.randint(0,len(nframe), size=3)
sample #generate random integer from 0 to len(nframe)

array([3, 5, 4])

In [36]:
nframe.take(sample)

Unnamed: 0,0,1,2,3,4,5,6
3,21,22,23,24,25,26,27
5,35,36,37,38,39,40,41
4,28,29,30,31,32,33,34


##String Manipulation

###Built-in Methods for String Manipulation
**split()** function separates parts of the text using a separator

In [37]:
text = '16 Bolton Avenue , Boston'
text.split(',')

['16 Bolton Avenue ', ' Boston']

But the first element has a space character in the end.

You have to the **split()** function along with the **strip()** function to trim the whitespace (including newlines)

In [40]:
tokens = [s.strip() for s in text.split(',')]
tokens

['16 Bolton Avenue', 'Boston']

You have to use a for loop since **text.split()** returns a list and the list doesn't have the **strip()** function

In [42]:
address, city = [s.strip() for s in text.split(',')]
address

'16 Bolton Avenue'

In [43]:
city

'Boston'

To concatenate strings,

In [44]:
address + ',' + city

'16 Bolton Avenue,Boston'

Use the **join()** function assigned to the separator character,

In [46]:
strings = ['A+','A','A-','B','BB','BBB','C+']
' ; '.join(strings)

'A+ ; A ; A- ; B ; BB ; BBB ; C+'

Search substring,

In [47]:
'Boston' in text

True

**index()** and **find()** return the number of the corresponding character in the text

In [48]:
text.index('Boston')

19

In [50]:
text.find('Boston')

19

For not found substring,

In [52]:
text.index('New York')

ValueError: ignored

In [54]:
text.find('New York')

-1

**count()** provides the how many times occured within the text

In [55]:
text.count('e')

2

In [56]:
text.count('Avenue')

1

**replace()** function,

In [59]:
text

'16 Bolton Avenue , Boston'

In [57]:
text.replace('Avenue', 'Street')

'16 Bolton Street , Boston'

In [58]:
text.replace('1','')

'6 Bolton Avenue , Boston'

###Regular Expressions
*regex* - a single expression

**re** module for operations of the regular expressions

In [0]:
import re

The **re** module provides the main categorical functions:
*  Pattern matching
*  Substitution
*  Splitting

In [63]:
text = "This is       an\t odd   \n text!"
re.split('\s+', text)

['This', 'is', 'an', 'odd', 'text!']

**\s+** regex is for expressing a sequence of one or more whitespace characters

**re.compile()** returns a reusable object regex

In [0]:
regex = re.compile('\s+')

In [65]:
regex.split(text)

['This', 'is', 'an', 'odd', 'text!']

**findall()** function mataches a regex pattern to any other substrings in the text and returns a list of all the substrings in the text that meet the requirements of the regex.

For example, if you want to  find all the words starting with "A" uppercase in a string, or with "a" regardless wheter upper- or lowercase:

In [66]:
text = 'This is my address: 16 Bolton Avenue, Boston'
re.findall('A\w+', text)

['Avenue']

In [67]:
re.findall('[A,a]\w+', text)

['address', 'Avenue']

**search()** function only returns the first match. The object returned by this function does not contain the value of the substring but its start and end positions within the string.

In [69]:
re.search('[A,a]\w+', text)

<_sre.SRE_Match object; span=(11, 18), match='address'>

In [70]:
search = re.search('[A,a]\w+', text)
search.start()

11

In [71]:
search.end()

18

In [72]:
text[search.start() : search.end()]

'address'

**match()** function only performs matching at the beginning of the string...

In [0]:
re.match('[A,a]\w+', text)

In [75]:
re.match('T\w+', text)

<_sre.SRE_Match object; span=(0, 4), match='This'>

In [76]:
match = re.match('T\w+', text)
text[match.start() : match.end()]

'This'

#Data Aggregation

##GroupBy
*split-apply-combine*
*  Splitting -- Division into groups of datasets
*  Applying -- Application of a function on each group
*  Combining -- Combination of all the results obtained by different groups

In [77]:
frame = pd.DataFrame({ 'color': ['white','red','green','red','green'],
                      'object': ['pen','pencil','pencil','ashtray','pen'],
                      'price1' : [5.56,4.20,1.30,0.56,2.75],
                      'price2' : [4.75,4.12,1.60,0.75,3.15]})
frame

Unnamed: 0,color,object,price1,price2
0,white,pen,5.56,4.75
1,red,pencil,4.2,4.12
2,green,pencil,1.3,1.6
3,red,ashtray,0.56,0.75
4,green,pen,2.75,3.15


Suppose you want to calculate the average of the price1 column using group labels listed in the color column.

Access the *price1* column and call the **groupby()** function with the *color* column

In [78]:
group = frame['price1'].groupby(frame['color'])
group

<pandas.core.groupby.SeriesGroupBy object at 0x7f3bc6c9f860>

In [79]:
group.groups

{'green': Int64Index([2, 4], dtype='int64'),
 'red': Int64Index([1, 3], dtype='int64'),
 'white': Int64Index([0], dtype='int64')}

In [80]:
group.mean()

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In [81]:
group.sum()

color
green    4.05
red      4.76
white    5.56
Name: price1, dtype: float64

###Hierarchical Grouping
make a grouping of multiple keys hierarchical

In [83]:
ggroup = frame['price1'].groupby([frame['color'], frame['object']])
#need to be in a list inside the groupby()
ggroup.groups

{('green', 'pen'): Int64Index([4], dtype='int64'),
 ('green', 'pencil'): Int64Index([2], dtype='int64'),
 ('red', 'ashtray'): Int64Index([3], dtype='int64'),
 ('red', 'pencil'): Int64Index([1], dtype='int64'),
 ('white', 'pen'): Int64Index([0], dtype='int64')}

In [84]:
ggroup.sum()

color  object 
green  pen        2.75
       pencil     1.30
red    ashtray    0.56
       pencil     4.20
white  pen        5.56
Name: price1, dtype: float64

Apply the grouping to multiple columns

In [85]:
frame[['price1','price2']].groupby(frame['color']).mean()

Unnamed: 0_level_0,price1,price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,2.025,2.375
red,2.38,2.435
white,5.56,4.75


In [86]:
frame.groupby(frame['color']).mean()

Unnamed: 0_level_0,price1,price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,2.025,2.375
red,2.38,2.435
white,5.56,4.75


##Group Iteration
The **GroupBy** object supports the operation of an iteration to *generate a sequence of **two-tuples** containing the **name** of the group together with the **data** portion.*

In [87]:
for name, group in frame.groupby('color'):
  print(name)
  print(group)

green
   color  object  price1  price2
2  green  pencil    1.30    1.60
4  green     pen    2.75    3.15
red
  color   object  price1  price2
1   red   pencil    4.20    4.12
3   red  ashtray    0.56    0.75
white
   color object  price1  price2
0  white    pen    5.56    4.75


##Chain of Transformation

In [88]:
result1 = frame['price1'].groupby(frame['color']).mean()
type(result1)

pandas.core.series.Series

In [89]:
result2 = frame.groupby(frame['color']).mean()
type(result2)

pandas.core.frame.DataFrame

Select a single column at any point in the various phases:

In [90]:
frame['price1'].groupby(frame['color']).mean()

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In [91]:
frame.groupby(frame['color'])['price1'].mean()

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In [92]:
(frame.groupby(frame['color']).mean())['price1']

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

After aggregation (**sum()** or **mean()**, etc), you may want to add a prefix to the column name that describes the type of aggregation operation to keep track of the source data.

Use the **add_prefix()** function

In [94]:
means = frame.groupby('color').mean().add_prefix('mean_')
means #               ^ don't need to write frame['color']

Unnamed: 0_level_0,mean_price1,mean_price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,2.025,2.375
red,2.38,2.435
white,5.56,4.75


##Functions on Groups

In [95]:
group = frame.groupby('color')
group['price1'].quantile(0.6)

color
green    2.170
red      2.744
white    5.560
Name: price1, dtype: float64

To use your own defined aggregation function, pass it to **agg()**

In [96]:
def range(series):
  return series.max() - series.min()

group['price1'].agg(range)

color
green    1.45
red      3.64
white    0.00
Name: price1, dtype: float64

In [99]:
group.agg(range)

Unnamed: 0_level_0,price1,price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,1.45,1.55
red,3.64,3.37
white,0.0,0.0


You can pass more aggregate functions in a list to **agg()**

In [101]:
group.agg(['mean','std',range])

Unnamed: 0_level_0,price1,price1,price1,price2,price2,price2
Unnamed: 0_level_1,mean,std,range,mean,std,range
color,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
green,2.025,1.025305,1.45,2.375,1.096016,1.55
red,2.38,2.573869,3.64,2.435,2.38295,3.37
white,5.56,,0.0,4.75,,0.0


##Advanced Data Aggregation
Suppose you want to combine the original dataframe and the one obtained by the group aggregation, for example, the sum.

In [102]:
frame = pd.DataFrame({ 'color':['white','red','green','red','green'],
                      'price1':[5.56,4.20,1.30,0.56,2.75],
                      'price2':[4.75,4.12,1.60,0.75,3.15]})
frame

Unnamed: 0,color,price1,price2
0,white,5.56,4.75
1,red,4.2,4.12
2,green,1.3,1.6
3,red,0.56,0.75
4,green,2.75,3.15


In [103]:
sums = frame.groupby('color').sum().add_prefix('tot_')
sums

Unnamed: 0_level_0,tot_price1,tot_price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,4.05,4.75
red,4.76,4.87
white,5.56,4.75


You can add the sum result using the **merge()**

In [104]:
pd.merge(frame, sums, left_on= 'color', right_index=True)

Unnamed: 0,color,price1,price2,tot_price1,tot_price2
0,white,5.56,4.75,5.56,4.75
1,red,4.2,4.12,4.76,4.87
3,red,0.56,0.75,4.76,4.87
2,green,1.3,1.6,4.05,4.75
4,green,2.75,3.15,4.05,4.75


Another way to achieve this is to use **transform()**

In [105]:
frame.groupby('color').transform(np.sum).add_prefix('tot_')

Unnamed: 0,tot_price1,tot_price2
0,5.56,4.75
1,4.76,4.87
2,4.05,4.75
3,4.76,4.87
4,4.05,4.75


**transform()** performs the aggregation and shows the values calculated based on the key value on each line of the dataframe to start.

The **apply()** function applies in its entirety the split-apply-combine scheme.

In [107]:
frame = pd.DataFrame( { 'color':['white','black','white','white','black','black'],
                       'status':['up','up','down','down','down','up'],
                       'value1':[12.33,14.55,22.34,27.84,23.40,18.33],
                       'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})
frame

Unnamed: 0,color,status,value1,value2
0,white,up,12.33,11.23
1,black,up,14.55,31.8
2,white,down,22.34,29.99
3,white,down,27.84,31.18
4,black,down,23.4,18.25
5,black,up,18.33,22.44


In [108]:
frame.groupby(['color','status']).apply(lambda x: x.max()) #pass a lambda function to apply()

Unnamed: 0_level_0,Unnamed: 1_level_0,color,status,value1,value2
color,status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
black,down,black,down,23.4,18.25
black,up,black,up,18.33,31.8
white,down,white,down,27.84,31.18
white,up,white,up,12.33,11.23


In [112]:
reindex = {
    0: 'first',
    1: 'second',
    2: 'third',
    3: 'fourth',
    4: 'fifth',
    5: 'sixth'
}
recolumn = {
    'status': 'object',
    'price': 'value'
}

frame.rename(index=reindex, columns=recolumn)

Unnamed: 0,color,object,value1,value2
first,white,up,12.33,11.23
second,black,up,14.55,31.8
third,white,down,22.34,29.99
fourth,white,down,27.84,31.18
fifth,black,down,23.4,18.25
sixth,black,up,18.33,22.44


In [118]:
temp = pd.date_range('10/19/2018', periods=10, freq= 'H')
temp #generate time log

DatetimeIndex(['2018-10-19 00:00:00', '2018-10-19 01:00:00',
               '2018-10-19 02:00:00', '2018-10-19 03:00:00',
               '2018-10-19 04:00:00', '2018-10-19 05:00:00',
               '2018-10-19 06:00:00', '2018-10-19 07:00:00',
               '2018-10-19 08:00:00', '2018-10-19 09:00:00'],
              dtype='datetime64[ns]', freq='H')

In [119]:
timeseries = pd.Series(np.random.rand(10), index= temp)
timeseries # use time log as indexes

2018-10-19 00:00:00    0.311210
2018-10-19 01:00:00    0.422271
2018-10-19 02:00:00    0.446513
2018-10-19 03:00:00    0.604656
2018-10-19 04:00:00    0.835793
2018-10-19 05:00:00    0.062849
2018-10-19 06:00:00    0.437614
2018-10-19 07:00:00    0.966204
2018-10-19 08:00:00    0.714539
2018-10-19 09:00:00    0.898949
Freq: H, dtype: float64

In [120]:
timetable = pd.DataFrame( {'date': temp,
                          'value1': np.random.rand(10),
                          'value2': np.random.rand(10)})
timetable

Unnamed: 0,date,value1,value2
0,2018-10-19 00:00:00,0.774752,0.673493
1,2018-10-19 01:00:00,0.080827,0.3065
2,2018-10-19 02:00:00,0.039249,0.794418
3,2018-10-19 03:00:00,0.248962,0.987976
4,2018-10-19 04:00:00,0.180555,0.96338
5,2018-10-19 05:00:00,0.658327,0.418741
6,2018-10-19 06:00:00,0.692688,0.382739
7,2018-10-19 07:00:00,0.123786,0.792842
8,2018-10-19 08:00:00,0.181257,5.5e-05
9,2018-10-19 09:00:00,0.405021,0.183632


You can add a column of text values to the dataframe

In [121]:
timetable['cat'] = ['up','down','left','left','up','up','down','right',
                    'right','up']
timetable

Unnamed: 0,date,value1,value2,cat
0,2018-10-19 00:00:00,0.774752,0.673493,up
1,2018-10-19 01:00:00,0.080827,0.3065,down
2,2018-10-19 02:00:00,0.039249,0.794418,left
3,2018-10-19 03:00:00,0.248962,0.987976,left
4,2018-10-19 04:00:00,0.180555,0.96338,up
5,2018-10-19 05:00:00,0.658327,0.418741,up
6,2018-10-19 06:00:00,0.692688,0.382739,down
7,2018-10-19 07:00:00,0.123786,0.792842,right
8,2018-10-19 08:00:00,0.181257,5.5e-05,right
9,2018-10-19 09:00:00,0.405021,0.183632,up
