## Replacing Values

1. replace() will create a new Series, but if you (inplace = True)argument, then you make changes to the existing Series. <br>
2. You can replace multiple values at once. <br>
    a. Pass a list and then the substitute value.<br>
    b. Pass a dictionary<br>
3. data.replace() method is distinct from data.str.replace() method, which performs string substitution element-wise.


In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.Series([1,-999,2,-999,-1000,3])
data

0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: int64

In [4]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [7]:
# Replace multiple values at once with list

data.replace([-999,-1000], [10,20])

0     1
1    10
2     2
3    10
4    20
5     3
dtype: int64

In [9]:
# Replace multiple values at once with dictionary 

data.replace({-999:30, -1000:40})

0     1
1    30
2     2
3    30
4    40
5     3
dtype: int64

## Renaming Axis Indexes

1. Transform Axis to produce new, differently labeled objects<br>
2. Like Series, the axis indexes have a MAP method<br>
3. map() method doesn't have (inplace=True)argument.<br>
4. 使用map, 返回的结果只是更改过的index value，如果想要dataset 的index发生变化的话，就需要将map值assign给现有的index。这样就有一个transformed version of a dataset with changing the orinigal dataset. <br>
4. rename() : if you want to have a tranformed version of a dataset without changing the original dataset, then use rename() method.<br>
5. rename() has the (inplace=True) argument to update the original dataset. 


In [10]:
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index=['Ohio','Colorado','New York'],
                   columns = ['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [14]:
# map() method

tranform = lambda x: x.upper()
data.index = data.index.map(tranform)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


In [16]:
# rename() method
data.rename(index=str.title, columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [19]:
# rename() method use dict-like object providing new values 
# for a subset of the axis labels.

data.rename(index={'OHIO':'INDIANA'}, columns ={'three':'peekaboo'},inplace=True)


In [20]:
data

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


## Discretization and Binning
1. cats = pd.cut(数据，bins)： 返回的是一个Categorical object.显示的是每个值所属于的bin。<br>
    a. (right=False) argument来表示,right是exclusive的。<br>
2. cats.codes返回的是属于那个bin的index.<br>
3. cats.categories返回的是category的具体信息.<br>
4. pd.value_counts(cats)返回的是每个bin中的value的个数<br>
5. 还可以给bin命名<br>
6. 可以不给出bin的value，而给出bin的个数。Python will compute equal-length bins based on the minimum and maximum values in the data. <br>
    a. (precision=2)limits the decimal precision to two digits.<br>
    b. 使用cut，bin的距离是相同的。equal-length bins。但是落入每个bin中的数值个数是不同的.<br>
7. qcut() --- bins the data based on sample quantiles -- bin的大小，距离是不同的，但是落入每个bin中数值个数是大致相同的。

In [21]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [22]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [23]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
              closed='right',
              dtype='interval[int64]')

In [24]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

In [26]:
# name the bins
group_names=['Youth','YoungAdult','MiddleAged','Senior']

cats = pd.cut(ages, bins, labels=group_names)

In [27]:
pd.value_counts(cats)

Youth         5
MiddleAged    3
YoungAdult    3
Senior        1
dtype: int64

In [28]:
# input is the number of bins
data = np.random.rand(20)
cat = pd.cut(data, 4, precision=2)
cat

[(0.0083, 0.26], (0.5, 0.75], (0.0083, 0.26], (0.75, 0.99], (0.75, 0.99], ..., (0.0083, 0.26], (0.0083, 0.26], (0.0083, 0.26], (0.5, 0.75], (0.5, 0.75]]
Length: 20
Categories (4, interval[float64]): [(0.0083, 0.26] < (0.26, 0.5] < (0.5, 0.75] < (0.75, 0.99]]

In [29]:
data=np.random.rand(1000)
cats = pd.qcut(data,4)
cats

[(0.773, 1.0], (0.518, 0.773], (0.518, 0.773], (0.00051, 0.271], (0.773, 1.0], ..., (0.773, 1.0], (0.00051, 0.271], (0.518, 0.773], (0.271, 0.518], (0.00051, 0.271]]
Length: 1000
Categories (4, interval[float64]): [(0.00051, 0.271] < (0.271, 0.518] < (0.518, 0.773] < (0.773, 1.0]]

In [30]:
pd.value_counts(cats)

(0.773, 1.0]        250
(0.518, 0.773]      250
(0.271, 0.518]      250
(0.00051, 0.271]    250
dtype: int64

In [31]:
# customize percentiles
cat1 = pd.qcut(data,[0,0.1,0.5,0.9,1])
cat1

[(0.518, 0.906], (0.518, 0.906], (0.518, 0.906], (0.104, 0.518], (0.518, 0.906], ..., (0.518, 0.906], (0.00051, 0.104], (0.518, 0.906], (0.104, 0.518], (0.00051, 0.104]]
Length: 1000
Categories (4, interval[float64]): [(0.00051, 0.104] < (0.104, 0.518] < (0.518, 0.906] < (0.906, 1.0]]

In [32]:
pd.value_counts(cat1)

(0.518, 0.906]      400
(0.104, 0.518]      400
(0.906, 1.0]        100
(0.00051, 0.104]    100
dtype: int64

## Detecting and Filtering Outliers

1. any() method --- select all rows having a value exceeding 3 or     -3, use any() method on a boolean DataFrame <br>
2. np.sign(data) return 1 or -1 based on whether the values in data are positive or negative. 如果是正数的话，不论多大都返回1.如果是负数的话，不论多小都返回-1.

In [33]:
data = pd.DataFrame(np.random.rand(1000,4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.500628,0.478834,0.50647,0.488543
std,0.284335,0.290392,0.288259,0.290407
min,0.000682,0.000337,0.001393,0.000992
25%,0.253402,0.209402,0.249379,0.227391
50%,0.501836,0.475919,0.495922,0.492779
75%,0.740481,0.732913,0.756842,0.733018
max,0.998587,0.99959,0.998096,0.999319


In [41]:
data[np.abs(data[2])>0.999][2]

Series([], Name: 2, dtype: float64)

In [42]:
# select all rows having a value exceeding 3 or -3, use any() method 
# on a boolean DataFrame

data[(np.abs(data)>0.999).any(1)]

Unnamed: 0,0,1,2,3
201,0.596549,0.379461,0.291522,0.999319
362,0.964083,0.157927,0.238261,0.999139
552,0.823604,0.99959,0.356829,0.743404


In [40]:
np.sign(-100)

-1

## Permutation (randomly reordering) and Random Sampling
1. np.random.permutation() -- 每怎么看懂 <br>
2. pandas dataframe.take() -- 根据position来取数据，而不是index的value来。take([0,1,2],axis=0)意思是取第1，2，3行的值，而不是index为0，1，2的值.<br>
3. pandas Dataframe.sample()-- used to generate a sample random row or column from the function caller data frame. <br>
    (1): n : int value, number of random rows to generate<br>
    (2): replace: boolean value,return sample with replacement if   True. Allow repeat choices。 我的理解，如果replace=true的话就是可以使用已经使用过的value，但如果是false的话，就是不能使用重复的value。所以如果sample的size比原数据的size大的话，就一定要使用replace=True<br>
    (3) 0 for 'row' and 1 for column<br>

In [54]:
df = pd.DataFrame(np.arange(5*4).reshape((5,4)))
sampler = np.random.permutation(5)
print(df)

    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19


In [50]:
df=pd.DataFrame({'color1':['red','blue'],'color2':['yellow','black']})
df

Unnamed: 0,color1,color2
0,red,yellow
1,blue,black


In [52]:
test = np.random.permutation(df['color1'])
test

array(['blue', 'red'], dtype=object)

In [55]:
df.sample(n=3)

Unnamed: 0,0,1,2,3
3,12,13,14,15
0,0,1,2,3
4,16,17,18,19


In [63]:
test = df.sample(n=20)
test

ValueError: Cannot take a larger sample than population when 'replace=False'

## Computing Indicator/Dummy Variables -- 太混乱了，没看懂

1. get_dummies()
2. Two ways to join the dummy matrix back to the original dataframe
    a. concat()
    b. join()
    c. merge()

In [64]:
df = pd.DataFrame({'key':['b','b','a','c','a','b'], 'data1':range(6)})
df

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,b


In [65]:
tst = pd.get_dummies(df['key'],prefix='key')
tst

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [67]:
test = pd.concat([df,tst], axis=1)
test

Unnamed: 0,data1,key,key_a,key_b,key_c
0,0,b,0,1,0
1,1,b,0,1,0
2,2,a,1,0,0
3,3,c,0,0,1
4,4,a,1,0,0
5,5,b,0,1,0


In [68]:
df_with_dummies = df.join(tst)
df_with_dummies

Unnamed: 0,data1,key,key_a,key_b,key_c
0,0,b,0,1,0
1,1,b,0,1,0
2,2,a,1,0,0
3,3,c,0,0,1
4,4,a,1,0,0
5,5,b,0,1,0


In [71]:
mnames=['movie_id','title','genre']
moviedata = pd.read_table('movies.dat',sep='::',header=None, names = mnames)
moviedata.head(5)

  


Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [74]:
# can not use add() method to add list to a set. 
# set.add() vs set.update()
# set.add() add a single element, can not use for iterable, unless it is hashable
# set.update() add a iterable to the current set. 

all_genres = set()

for x in moviedata['genre']:
    test = x.split('|')
    all_genres.update(x.split('|'))

all_genres

{'Action',
 'Adventure',
 'Animation',
 "Children's",
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western'}

In [78]:
zero_matrix = np.zeros((len(moviedata),len(all_genres)))
zero_matrix[:1]

array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.]])

In [79]:
dummies = pd.DataFrame(zero_matrix, columns=all_genres)
dummies.head(5)

Unnamed: 0,Musical,Western,Adventure,Drama,Romance,Crime,Fantasy,Animation,Action,Horror,Thriller,Film-Noir,Documentary,Mystery,Comedy,War,Sci-Fi,Children's
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [85]:
gen = moviedata['genre'][0]
gen.split('|')
dummies.columns.get_indexer(gen.split('|'))

array([ 7, 17, 14], dtype=int64)