<h1><font color = #fc7cc9> Ch. 7 Data Cleaning and preparation
    <br>pg. 191 - 251
    

In [1]:
import pandas as pd
import numpy as np

### In general:
<b> https://realpython.com/python-map-function/</b><br>
A helpful link for this entire chapter, that explains what Mapping, Filtering and Reducing are. 

<h2> <font color = #39abed> 7.1 Handling Missing Data
    </h2> <br> 
    The standard in pandas is to use NaN for floating pt values.
 

In [2]:
string_data = pd.Series(['aardvark', 'artichoke',  np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [3]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

    pandas use of NA/not available is adopted from R.
    There is also the built-in Python 'None' value which is also treated like an NA object array. 

In [4]:
string_data[0] = None  # We are replacing the FIRST value in string_data with an NaN equivalent.
string_data.isnull() # We ask how many values are NaN now. 

0     True
1    False
2     True
3    False
dtype: bool

See Tabel 7-1 on pg 192 for a table for some other NA methods.
E.g. - fillna, fill the missing data with some value. 

<h3> <font color = #39abed>Filtering out missing data
    </h3> <br>  Can do this by using pandas.isnull and boolean indexing, OR by using 'dropna'

In [5]:
# FOR A Series: 

from numpy import nan as NA

data = pd.Series([1, NA, 3.5, NA, 7])  #Create the series
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [6]:
# Now drop all NA values from the series
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [7]:
# You can do the above .dropna() OR you can do:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

    FOR DATAFRAMES: You may want to drop rows or columns. Using .dropna by defult drops any row containing a missing value.

In [8]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], 
                     [NA, NA, NA], [NA, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [9]:
cleaned = data.dropna() # deleting all ROWS with NA values
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


    However, using how='all' will only drop rows where ALL its calues are NA

In [10]:
data.dropna(how='all') # So only row 2 was dropped since it had ALL NaN values.

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [11]:
# To drop columns the same way as above, use axis = 1
data[4] = NA  # This says, add a column '4', and make all values NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [12]:
# Next, drop columnes there ALL values are NaN
data.dropna(axis = 1, how = 'all')

# MUST put in the axis =1, otherwise only rows with all NAs will be dropped, not columns. 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


    It is also common to filter rows of a DataFrame from time series data. Suppose you want to keep only rows containing a certain no. of observation. You can indicate this by use the *thresh* argument.

In [13]:
df = pd.DataFrame(np.random.rand(7, 3)) # Create the dataframe.
df.iloc[:4, 1] = NA  # iloc is an index selector, saying replace up the first 4 rows, in column '1' with NaN
df.iloc[:2, 2] = NA # replace all values up to the 3rd row of the column '2' with NA
df

Unnamed: 0,0,1,2
0,0.156056,,
1,0.299272,,
2,0.801893,,0.789162
3,0.152992,,0.705972
4,0.940007,0.125207,0.971519
5,0.080234,0.123432,0.31998
6,0.522403,0.595294,0.793508


In [14]:
df.dropna() # Get rid of anything row/column with a NA value.

Unnamed: 0,0,1,2
4,0.940007,0.125207,0.971519
5,0.080234,0.123432,0.31998
6,0.522403,0.595294,0.793508


In [15]:
# Instead of the above, can do the following if tou want keep certain rows?
df.dropna(thresh = 2) 

#The above reads, drop all NA values that are in col '2'
# Notice that in turn this elimnated two rows as well, cuz otherwise the dataframe would be imbalanced.

Unnamed: 0,0,1,2
2,0.801893,,0.789162
3,0.152992,,0.705972
4,0.940007,0.125207,0.971519
5,0.080234,0.123432,0.31998
6,0.522403,0.595294,0.793508


<h3> <font color = #39abed>Filtering <i>in</i> missing data </h3> <br>
    You may want to fill in the 'holes' in the data.
    The most common method is fillna, with a constant ro replace the missing values with that given constant.

In [16]:
df

Unnamed: 0,0,1,2
0,0.156056,,
1,0.299272,,
2,0.801893,,0.789162
3,0.152992,,0.705972
4,0.940007,0.125207,0.971519
5,0.080234,0.123432,0.31998
6,0.522403,0.595294,0.793508


In [17]:
df.fillna(0) # Fill in all the NAs in df with a 0

Unnamed: 0,0,1,2
0,0.156056,0.0,0.0
1,0.299272,0.0,0.0
2,0.801893,0.0,0.789162
3,0.152992,0.0,0.705972
4,0.940007,0.125207,0.971519
5,0.080234,0.123432,0.31998
6,0.522403,0.595294,0.793508


    You can also call fillna with a dict, if you want to have a different fill value for each column!

In [18]:
df # What it looks like before

Unnamed: 0,0,1,2
0,0.156056,,
1,0.299272,,
2,0.801893,,0.789162
3,0.152992,,0.705972
4,0.940007,0.125207,0.971519
5,0.080234,0.123432,0.31998
6,0.522403,0.595294,0.793508


In [19]:
# What it looks like after
df.fillna({1: 0.5, 2:0}) # For col '1', replace all NA with 0.5, and for col '2' replace all NAs with 0

Unnamed: 0,0,1,2
0,0.156056,0.5,0.0
1,0.299272,0.5,0.0
2,0.801893,0.5,0.789162
3,0.152992,0.5,0.705972
4,0.940007,0.125207,0.971519
5,0.080234,0.123432,0.31998
6,0.522403,0.595294,0.793508


### <font color = 'red'>[!] WTF does this mean?? pg. 195
    fillna returns a _new object_, but you can modify the existing object in place... (??)

In [20]:
_ = df.fillna(0, inplace = True) # It's the same output if False?
df

Unnamed: 0,0,1,2
0,0.156056,0.0,0.0
1,0.299272,0.0,0.0
2,0.801893,0.0,0.789162
3,0.152992,0.0,0.705972
4,0.940007,0.125207,0.971519
5,0.080234,0.123432,0.31998
6,0.522403,0.595294,0.793508


In [21]:
# You can use the same redindexing methods with fillna:
df = pd.DataFrame(np.random.randn(6, 3)) #creating a new data from, 6x3 with random no.
df

Unnamed: 0,0,1,2
0,0.737183,-0.082343,-1.785441
1,-0.006216,-0.134977,-0.675448
2,0.239683,-2.346805,-0.74057
3,1.136098,-1.277385,0.0696
4,0.931045,1.773199,-0.304626
5,1.122634,-0.719763,-0.34261


In [22]:
df.iloc[2:, 1] = NA   #Replace row 2 onwards, in col '1', with NA
df.iloc[4:, 2] = NA   #Replace rows 4 onwards in col '2' with NA
df

Unnamed: 0,0,1,2
0,0.737183,-0.082343,-1.785441
1,-0.006216,-0.134977,-0.675448
2,0.239683,,-0.74057
3,1.136098,,0.0696
4,0.931045,,
5,1.122634,,


In [23]:
# can also do
df.fillna(method = 'ffill')

Unnamed: 0,0,1,2
0,0.737183,-0.082343,-1.785441
1,-0.006216,-0.134977,-0.675448
2,0.239683,-0.134977,-0.74057
3,1.136098,-0.134977,0.0696
4,0.931045,-0.134977,0.0696
5,1.122634,-0.134977,0.0696


   <blockquote><b>ffill() function is used to fill the missing value in the dataframe. 'ffill' stands for 'forward fill' and will propagate last valid observation forward</blockquote>
    AKA repeats the NA with the previous data value

In [43]:
# Only apply the Forward Fill to 2 values, max
df.fillna(method = 'ffill', limit = 2)

Unnamed: 0,0,1,2
0,-0.639978,-0.799902,-0.070594
1,0.725652,0.243492,0.511909
2,-0.386548,0.243492,0.272159
3,-0.475659,0.243492,0.13965
4,0.110501,,0.13965
5,1.260373,,0.13965


In [44]:
# You can also pass the mean or median value with ffill to a Series!
data = pd.Series([1., NA, 3.5, NA, 7])
data # what the series looks like before:

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [45]:
#What it looks like after, when we replace all NAs with the median value
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

    See Table 7.2 on pg 197 for more fillna arugments
    The default is row (axis = 0), for columns, axis = 0

<h2> <font color = #39abed> 7.2 Data Transformation

<h3> <font color = #39abed>Removing Duplicates


In [24]:
# Example of a df with duplicates:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                    'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [25]:
# The df method 'duplicated' returns a boolean Series indication if there are any duplicates in the NAME of the row, based on the PREVIOUS one... i.e. so if there are 2 in a row
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [48]:
# drop_duplicates returns a df where the 'duplicated' array is False:
data.drop_duplicates()


#aka. this method will drop all cases where method 'duplicated' is True. 

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [28]:
data.drop_duplicates(['k1']) # show all unique values in k1

#BE CAREFUL when you apply these methods. Data needs to be sorted. 

Unnamed: 0,k1,k2
0,one,1
1,two,1


    You can also indicate if you only want to drop duplicates from a specific column!

In [29]:
# First, start by adding another col to the df, called v1
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [30]:
# Next, drop any duplicates that are found in k1 only 
data.drop_duplicates(['k1'])


# Remember that when deletings values like NAs or duplicates, this will drop the ENTIRE row it is found in, so that the data is neat/matches an even matrix/dataframe form.

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


<h3> <font color = #39abed>Transforming data using a function or mapping </h3>
    You may want to transform values in an array, Series, or column in DF.


In [31]:
# First, create some data to work with

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                             'Pastrami', 'corned beef', 'Bacon',
                             'pastrami', 'honey ham', 'nova lox'],
                    'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


In [32]:
# If you want to add a column that tells the type of animal of each good. 
# First, we can write down the _mapping_ or legend for these meatz

meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}

### <font color = 8f4cc2> Mapping!
<blockquote> Python's map() is a built-in function that allows you to process and transform all the items in an iterable without using an explicit for loop, a technique commonly known as mapping. map() is useful when you need to apply a transformation function to each item in an iterable and transform them into a new iterable</blockquote>
<b> https://realpython.com/python-map-function/ </b>

    In other words, the map is like adding a "label" to X variables/categories in that label to allow for transformations to ba applied to just that category. See example below.
    

In [33]:
# The 'map' method on a Series accpets a function or dict-like object containing a map...
# BUT before that, the capitalisations needs to match EXACTLY. 
# So, we need to transform the data so capitalisations match.

lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [34]:
# Now, we are going to APPLY this new lower case key to the data + the map...

data['animal'] = lowercased.map(meat_to_animal) # Create the key/column 'animal', which will be the lowercased MAP values of meat_to_animal
data

# Adding another column/index
# See page 199 for a better, color-coated version of the table. There you will see that animal col looks different since it is a _map_

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


### <font color = 'red'>[!] WTF does this mean?? pg. 300
    fillna returns a _new object_, but you can modify the existing object in place... (??)
    
https://realpython.com/python-lambda/

In [35]:
# Instead of the above, a function could have also been passed instead

data['food'].map(lambda x: meat_to_animal[x.lower()])

#the above is saying...???? BUT WHY x. lower() ?! pg 200
# Answer: for the 'food' col in data df, add the map to x, of the 'meat_to_animal',

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

<h3> <font color = #39abed>Replacing Values


In [36]:
# First, create example data/Series
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data


#In this example -999 could stand for missing data.

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [37]:
# Now, we want to replace -999 values with NA/NaN
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [38]:
# IF you want to replace MULTIPLE values at the same time, pass a list!
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [39]:
# to place MULTIPLE values with THEIR OWN unique value:
data.replace([-999, -1000], [np.nan, 0])  # Where np.nan is for -999, and 0 for -1000

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [40]:
# The above can ALSO be passed as a dict
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

<h3> <font color = #39abed>Renaming Axis Indexes </h3>
     
     Axis labels can also be transformed with a function or mapping. This will produce a new, differently labeled object. You can also modify axes in/place and without creating a new data structure.

In [41]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                   index = ['Ohio', 'Colorado', 'New York'],
                   columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [42]:
# Axis indexes also have a map method:
transform = lambda x: x[:4].upper()  

# First, we are creating the function, which states:
''' The function called transform does this:
takes the input value(x), up to the first 4 observations and make them upper case (x)'''

' The function called transform does this:\ntakes the input value(x), up to the first 4 observations and make them upper case (x)'

In [43]:
#Now, apply the lambda function 'transform'
data.index = data.index.map(transform) 

In [44]:
# Now look at the data
data

# Here, the 'transform' function was successful, and replaced the previous axis names with shorter ones. 

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


In [45]:
# Using the method 'rename' will transform the data, but NOT modify the original data
data.rename(index = str.title, columns = str.upper)

# The above just changes the title of the index to uppercase

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [46]:
# Rename can also be like with a dict-like object, and new axis labels
data.rename(index = {'OHIO': 'INDIANA'}, 
           columns = {'three': 'peekaboo'})

# The above renames the index Ohio to Indian, and col three to peekaboo

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


    Rename is used when you DO NOT want to copy the Dataframe manuall and assign values to is index and cols.
    If you want to modify the data with rename officially, you can pass 'inplace = True'.

In [47]:
data.rename(index = {'OHIO': 'INDIANA'}, inplace = True)
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


<h3> <font color = #39abed>Discretization and Binning</h3><br>
    Continuous data is usually 'discretized' aka, separated into bins. See example below with groups of people in a study and you want to group them into discrete age buckets: <br>
    <p> (Common for continuous variables, like age)

In [48]:
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]


# Now divide the ages data indo bins, using pd.cut

cats = pd.cut(ages, bins) # This says put the data along with its bin value.
cats


# What is returned is a special pandas 'Categorical' objects, and can be treated like and array of strings.

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

See official doc on pd.cut aka, using bins in pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

#### An example project with pd.cut / bins:
https://realpython.com/fast-flexible-pandas/

In [49]:
cats.codes  # which bin each data from 'ages' belongs to...the first, second, etc. 

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [50]:
cats.categories # all the different bins that are observed for the cats/ages data

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [51]:
pd.value_counts(cats)  # the bin counts for the results of pd.cuts, like a bin tally

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

<blockquote>
    Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). You can change which side is closed by passing right=False:
    <b>pg. 203

In [52]:
pd.cut(ages, [18, 26, 36, 61, 100], right = False) # see above for this to make sense, now it is reverse which bin is open, which is closed.

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

#### You can also pass your OWN bin names, by passing a list or array to the 'labels' option!

In [53]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels = group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

<blockquote> If you pass an integer number of bins to cut instead of explicit bin edges, it will compute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths: (see below)
<b> pg. 204

In [54]:
data = np.random.rand(20)
data

array([0.93145275, 0.95579816, 0.94662039, 0.54666561, 0.37402652,
       0.78703519, 0.16061905, 0.90558626, 0.97674698, 0.19859324,
       0.84854729, 0.21723832, 0.89354542, 0.57000467, 0.40963353,
       0.7795278 , 0.67619957, 0.73989992, 0.12298183, 0.96467085])

In [56]:
pd.cut(data, 4, precision = 2) # precision limits the decimal place to 2

[(0.76, 0.98], (0.76, 0.98], (0.76, 0.98], (0.34, 0.55], (0.34, 0.55], ..., (0.76, 0.98], (0.55, 0.76], (0.55, 0.76], (0.12, 0.34], (0.76, 0.98]]
Length: 20
Categories (4, interval[float64]): [(0.12, 0.34] < (0.34, 0.55] < (0.55, 0.76] < (0.76, 0.98]]

### <font color = 'red'>[ ? ] Does this mean that you shuold NOT use interger number of bins?

### Using qcut
There is also th function 'qcut' which will bing the data based on sample quantiles. You may want to use this instead of 'cut' because 'cut' may not give you an even number of data points in each bin. 
<blockquote>
    Since qcut  uses sample quantiles instead, by definition you will obtain roughly equal-size bins: (see below)

In [57]:
data = np.random.randn(1000) # Normally distributed
cats = pd.qcut(data, 4) # Cut into 4 quartiles, approx. evenly.

cats

[(0.694, 2.864], (0.694, 2.864], (0.694, 2.864], (-2.8899999999999997, -0.664], (-0.664, -0.0044], ..., (-2.8899999999999997, -0.664], (-0.664, -0.0044], (-0.0044, 0.694], (-0.0044, 0.694], (-2.8899999999999997, -0.664]]
Length: 1000
Categories (4, interval[float64]): [(-2.8899999999999997, -0.664] < (-0.664, -0.0044] < (-0.0044, 0.694] < (0.694, 2.864]]

In [58]:
pd.value_counts(cats) # Shows you how many data points are in each 'bin'

(0.694, 2.864]                   250
(-0.0044, 0.694]                 250
(-0.664, -0.0044]                250
(-2.8899999999999997, -0.664]    250
dtype: int64

In [59]:
# You can also pass your OWN quantiles with qcut, using no. between 0 and 1, inclusive.
dog = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])
dog

[(1.273, 2.864], (-0.0044, 1.273], (-0.0044, 1.273], (-1.349, -0.0044], (-1.349, -0.0044], ..., (-1.349, -0.0044], (-1.349, -0.0044], (-0.0044, 1.273], (-0.0044, 1.273], (-1.349, -0.0044]]
Length: 1000
Categories (4, interval[float64]): [(-2.8899999999999997, -1.349] < (-1.349, -0.0044] < (-0.0044, 1.273] < (1.273, 2.864]]

In [60]:
pd.value_counts(dog) # this is how it looks...

(-0.0044, 1.273]                 400
(-1.349, -0.0044]                400
(1.273, 2.864]                   100
(-2.8899999999999997, -1.349]    100
dtype: int64

### <font color = 'red'> [ ? ] When and why would you ever want to set your own quantiles? I dont get the example either... (see above chunks).

<h3> <font color = #39abed> Detecting and Filtering Outliers </h3>
<p> Maily just applying array operations. Below will be an example with normally distributed data. 


In [61]:
data = pd.DataFrame(np.random.randn(1000, 4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,-0.004553,0.036195,0.01861,0.002886
std,0.994398,1.018652,0.985244,1.029536
min,-3.046329,-3.290891,-3.244908,-3.380695
25%,-0.656795,-0.642763,-0.655207,-0.641044
50%,0.005968,0.039392,0.033394,0.026727
75%,0.664241,0.740526,0.658787,0.664783
max,3.056419,3.699896,3.478393,3.861149


In [32]:
#Suppose you wanted to find values in one of the columns exceeding 3 in absolute value: 
col = data[2] # first isolate/ select only the 3rd column of the data

col[np.abs(col) >3]

599   -3.728292
Name: 2, dtype: float64

In [33]:
# To select the rows that have a value of 3 or -3 you can use the 'any' method...
data[(np.abs(data) >3).any(1)]

# Notes, SELECT A ROW, THE WHOLE DAMN ROW that has a +-3 value

Unnamed: 0,0,1,2,3
143,-3.004345,-0.899828,0.240154,0.192567
400,3.164499,0.813197,-0.078097,-0.080511
426,3.002462,-1.29621,-0.84269,1.191065
503,-1.410977,-3.727977,0.805915,0.86597
599,-1.00986,-0.39657,-3.728292,-1.332789
626,2.153357,-1.127845,0.182386,-3.228248


In [37]:
# 'Code to cap values outside the interval -3 to 3'
data[np.abs(data) > 3] = np.sign(data) * 3  # we do this to get just a clean -3 or 3
data.describe()

## ??? But what is the point of doing this 

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.045541,-0.006177,-0.034506,-0.03087
std,1.002962,0.997532,0.973474,1.001047
min,-3.0,-3.0,-3.0,-3.0
25%,-0.615337,-0.68466,-0.667265,-0.687612
50%,0.058913,0.044233,-0.006272,-0.035844
75%,0.672576,0.668767,0.597902,0.65998
max,3.0,2.834683,2.631226,2.802174


In [38]:
# np.sign(data) produced 1 and -1 values on whether the values in the data are pos or negative
np.sign(data).head()

Unnamed: 0,0,1,2,3
0,-1.0,1.0,-1.0,-1.0
1,-1.0,1.0,1.0,-1.0
2,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0
4,-1.0,1.0,-1.0,-1.0


<h3> <font color = #39abed>Permutation and Random Sampling </h3>
    <p> Permuting (randomly reordering) can be down with np.random.permutation. Calling permutation with the length of the axis you want to permute creates an array of intergers indicating the new reordering: (see below).
    <p> Also for offic. doc. https://numpy.org/doc/stable/reference/random/generated/numpy.random.permutation.html

In [62]:
df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))
df 

#Creating the data and taking a preliminary look at the data

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [40]:
sampler = np.random.permutation(5)  # Randomly rearrange an array of 5 items
sampler  # This outputs says re-arrange items 0, 1, 2, 3, 4 like so:

array([2, 0, 3, 4, 1])

In [41]:
# The array can then be used in _iloc_ based indexing or the equivalent _take_ function
df

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [42]:
df.take(sampler) # This randomly permutated the INDEXES aka the ENTIRE ROWS of the data

# Notice, this is in the ORDER that the array was given, not numerical!

Unnamed: 0,0,1,2,3
2,8,9,10,11
0,0,1,2,3
3,12,13,14,15
4,16,17,18,19
1,4,5,6,7


### <font color = 'red'> [ ? ] So, does that mean sampler MUST match the no. of rows/indexes in the dataframe to work?

In [43]:
# To select a random subset w.o. replacement, you can use _sample_ method on a Series or DF
df.sample(n = 3)  # Give me 3 random rows/indexes

Unnamed: 0,0,1,2,3
1,4,5,6,7
0,0,1,2,3
2,8,9,10,11


In [45]:
# To generate a sample WITH replacement, pass replace = True to sample. 
choices = pd.Series([5, 7, -1, 6, 4])
draws = choices.sample(n = 10, replace = True)
draws

0    5
4    4
3    6
2   -1
3    6
1    7
2   -1
2   -1
0    5
0    5
dtype: int64

<h3> <font color = #39abed>Computing Indicator/Dummy Variables </h3>
    <p> Pandas has a get_dummies function, and converts categorical variables into dummy or indicator variables.

In [63]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                  'data1': range(6)})

df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [64]:
pd.get_dummies(df['key']) # Because there are 3 diff values, a, b, c each have their own key...
# Where col a means give a 0 to all values that are NOT a, and col b says give value 0 to all those thare NOT b, etc. 

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [50]:
# To make it easier to read, can add a prefix to name the different keys better.
dummies = pd.get_dummies(df['key'], prefix = 'key')
df_with_dummy = df[['data1']].join(dummies)

df_with_dummy

Unnamed: 0,data1,key_a,key_b,key_c
0,0,0,1,0
1,1,0,1,0
2,2,1,0,0
3,3,0,0,1
4,4,1,0,0
5,5,0,1,0


If a row in a DF belongs to multipl categories, it is more complcated. See example below looking at the MovieLens 1M dataset, which is investigated in more detail in Ch. 14

In [65]:
mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('C:\\Users\\Kitty\\Desktop\\learnpy\\movies.dat', sep = '::',
                       header = None, names = mnames)  # Had to put r in front, or \\

movies[:10] # give me the first 10 rows of data

  movies = pd.read_table('C:\\Users\\Kitty\\Desktop\\learnpy\\movies.dat', sep = '::',


Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


In [66]:
# To add indicator variables, need to wrangle more.
#First, extract the list of unique genres in the datset

all_genres = []

for x in movies.genres:
    all_genres.extend(x.split('|'))  
    
# Above- creating a for loop to go through all genres and split them up by the 

In [67]:
genres = pd.unique(all_genres) #print all the unique cases of genres, to see how many different ones there are

genres

array(['Animation', "Children's", 'Comedy', 'Adventure', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Sci-Fi', 'Documentary', 'War', 'Musical', 'Mystery', 'Film-Noir',
       'Western'], dtype=object)

#### One way to construct the indicator DF is to start with a DF of all zeros (see code below)
<font color = 'red'>[?] Is there another way to do this?

In [68]:
# Start with a dataframe of 0, with same number of rows as the movies, and number of columns is the same as the number of each unique genre (all 18 of them).
zero_matrix = np.zeros((len(movies), len(genres))) # make a 0 matrix that has same length/dimensions as the following variables in the DF
dummies = pd.DataFrame(zero_matrix, columns = genres)

dummies

Unnamed: 0,Animation,Children's,Comedy,Adventure,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Sci-Fi,Documentary,War,Musical,Mystery,Film-Noir,Western
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3879,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3880,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Now interact through each movie and set in each row of dummies to 1... by using dummies.columnes to calculate the column indices for each genre.

In [69]:
gen = movies.genres[0] # the genre of the movies in the first row
gen

# This is kind of like setting the stage/starting pt for creating the dummie variables

"Animation|Children's|Comedy"

In [70]:
gen.split('|') #split the gen variable 

['Animation', "Children's", 'Comedy']

#### <font color = 'red'>[?] WTF does this do, pg 210. pls explain

In [71]:
dummies.columns.get_indexer(gen.split('|')) 

# WTF HAPPENED BRUH
#Compute colm indicies for each genre...

array([0, 1, 2], dtype=int64)

#### <font color = 'red'>[?] WTF happened here pg 209. pls explain

In [72]:
# Next, can use .iloc to set values based on these indicies that were jsut created
for i, gen in enumerate(movies.genres):
    indicies = dummies.columns.get_indexer(gen.split('|'))
    dummies.iloc[i, indicies] = 1

In [74]:
# Now combine with movies...
movies_windic = movies.join(dummies.add_prefix('Genre_'))
movies_windic.iloc[5] # The three 1s are what the genre of the that movie is, the movie in row 6

movie_id                                 6
title                          Heat (1995)
genres               Action|Crime|Thriller
Genre_Animation                          0
Genre_Children's                         0
Genre_Comedy                             0
Genre_Adventure                          0
Genre_Fantasy                            0
Genre_Romance                            0
Genre_Drama                              0
Genre_Action                             1
Genre_Crime                              1
Genre_Thriller                           1
Genre_Horror                             0
Genre_Sci-Fi                             0
Genre_Documentary                        0
Genre_War                                0
Genre_Musical                            0
Genre_Mystery                            0
Genre_Film-Noir                          0
Genre_Western                            0
Name: 5, dtype: object

#### Another useful method for stats applications is to combine get_dummies with a discretization function like 'cut':

In [32]:
np.random.seed(12345)
values = np.random.rand(10)
values

array([0.92961609, 0.31637555, 0.18391881, 0.20456028, 0.56772503,
       0.5955447 , 0.96451452, 0.6531771 , 0.74890664, 0.65356987])

In [34]:
bins = [0, 0.2, 0.4, 0.6, 0.8, 1]
pd.get_dummies(pd.cut(values, bins))

#Set the random seed with numpy.random.seed to make the example 'deterministic'
#later, the book will explore pd.get_dummies

Unnamed: 0,"(0.0, 0.2]","(0.2, 0.4]","(0.4, 0.6]","(0.6, 0.8]","(0.8, 1.0]"
0,0,0,0,0,1
1,0,1,0,0,0
2,1,0,0,0,0
3,0,1,0,0,0
4,0,0,1,0,0
5,0,0,1,0,0
6,0,0,0,0,1
7,0,0,0,1,0
8,0,0,0,1,0
9,0,0,0,1,0


<h2> <font color = #39abed> 7.3 String Manipulation

<h3> <font color = #39abed>String Object Methods


In [75]:
# E.g., a common-separate string can be broken into pieces with 'split'
val = 'a,b,  guido'
val.split(',')

['a', 'b', '  guido']

In [76]:
# 'split' is often combined with 'strip' to strim white space, including line breaks
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [77]:
# These substrings could be added together with 2 colon delimiter using addition
# WHUT

first, second, third = pieces
first + '::' + second + '::' + third

# ... Interesting...but why?  -See below, this is not practical or very 'Python' like.

'a::b::guido'

In [78]:
# However, the above is not practical...
# Faster to pass a list or tuple to the 'join' method on the string '::'
'::'.join(pieces)

'a::b::guido'

In [79]:
# Other methods deal with locating substrings
'guido' in val

True

In [80]:
val.index(',')

1

In [81]:
val.find(':') # Returns -1 if the string isn't found

-1

In [82]:
val.index(':') # Where as this just returns an error, compared to "val find"

ValueError: substring not found

In [83]:
# 'Count' returns the no. of occurences of a particular substring
val.count('')

12

In [84]:
# 'Replace' will subsisitute a value with the one u tell it to.
# Commonly used to delete patterns too, by just passing an empty string

val.replace(',', '::') # For every ',' replace it with '::'

'a::b::  guido'

In [85]:
val.replace(',', '') # For every ',' DELETE by replacing it with nothing

'ab  guido'

### See table 7.3 on pg 213 for a more Python built-in string methods

<h3> <font color = #39abed>Regular Expressions </h3>
<p>A way to search or match string patterns in a text.
A single expression is called a 'regex', and is a string formed according to the the regular expression language.
<p> Python has built in 're', and its functions fall into 3 categories: <b> pattern matching, subsitution, and splitting.</b>


In [86]:
# Ex. We want to split a string with a variable number of whitespace characters,
# in regext, whitespace characters a '\s+'
# Whitespace includes: tabs, spaces, and newlines.

import re

In [87]:
text = "foo         bar\t baz   \tqux"
text

'foo         bar\t baz   \tqux'

In [88]:
re.split('\s+', text)  # Split up 'text', by the whitespace

['foo', 'bar', 'baz', 'qux']

In [89]:
# When re.split is called, the regex is first *compiled*, and then split is called.
# You can compile a regex urself, to make a reusable regex object.

regex = re.compile('\s+')  
regex.split(text)

# So basically, make a function of the compiling first (?)
## ?? What is the point? Faster? Asnwer: YES

['foo', 'bar', 'baz', 'qux']

In [90]:
# If you want to get a list of all patterns matching the regex, use findall.

regex.findall(text)  # Show me a list of all the whitespace in 'text'

['         ', '\t ', '   \t']

### 'match' and 'search' are similar to 'findall' 
<p>'findall' returns ALL matches in a string.<br>
'search' returns only the FIRST match.<br>
'match' ONLY matches at the beginning of the string.<br>
<p> See below for an example with a block of text and a regex capable of identifying most email addresses.

In [91]:
# JUST AN EXAMPLE GIVEN IN THE BOOK, pg 214
text = """Dave dave@google.com 
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com 
""" 
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}' #give me only email addresses, anything WITH an @ symbol

# re.IGNORECASE makes the regex case-insensitive 
regex = re.compile(pattern, flags=re.IGNORECASE) 

In [92]:
regex.findall(text) 

#So this is saying, use the previously define 'regex' that are an email address, and not jsut a name

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [93]:
# 'search' returns a special match object for the 1st email only.... huh
m = regex.search(text)
m

# And tells us where it is... >_>  ????

<re.Match object; span=(5, 20), match='dave@google.com'>

In [94]:
text[m.start():m.end()] # of the value of the search result (aka 'm') pls give me the full beginning and end of that string

'dave@google.com'

In [95]:
#regex.match returns None, as it onyl will match if the pattern occurs at the START of the string
print(regex.match(text))

None


In [96]:
# Similarly, 'sub' will return a new string with occurrences of the pattern replace by the string
print(regex.sub('REDACTED', text))

# AKA use sub to replace whatever meets the previously defined 'regex' with the indicated value/in this case REDACTED

Dave REDACTED 
Steve REDACTED
Rob REDACTED
Ryan REDACTED 



<blockquote> Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment: <br>
    <b> pg. 215

In [97]:
# Step 1 - COPIED FROM BOOK 
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

# This is the defining first what we want in the example text above:
# To segment the emails in 3 component.

In [28]:
# Step 2
regex = re.compile(pattern, flags = re.IGNORECASE)

# We are compiling it now/defining the regex, and saying that upper or lower case dont matter

In [29]:
# Step 3 -
# A 'match' object produced by this regex returns a tuple of the pattern components with method 'groups'
m = regex.match('wesm@bright.net') # instead of our text list, we are applying above steps to this new email address.
m.groups()

# We we can put it all together, to get what we originally wanted in the example

('wesm', 'bright', 'net')

In [30]:
# 'findall' returns of list of tuples when the pattern has groups.
regex.findall(text)  # now apply the above to our previous "text" with all other emails

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [31]:
# 'sub' also has access to groups in each match using special symbols
# like \1 (first matches group) or \2 (second matched group, etc)

print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

# the above is saying, using the regex, add these labels to each of the matched groups.
# E.g., call the 1st group 'Username', etc.

Dave Username: dave, Domain: google, Suffix: com 
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com 



#### See table 7.3 pg. 216 for table of more regex methods

<h3> <font color = #39abed>Vectorized String Functions in pandas</h3>
<p>(e.g., for when having to clean data, and some strings have missing data)

In [34]:
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [35]:
data.isnull() # Check if/where is there missing data

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

You can apply string and regex methods can be applied (by passing a lambda or other function) to each value using 'data.map' <b>but it will fail on the NA / null values</b>.
<p>To deal with this, Series has array-oriented methods for string operations that <u>skip NA values</u>! These are accessed via Serie's 'str' attribute.
<br>E.g., we can check whether each email address has 'gmail' in it with 'str.contains' (see below)

In [64]:
data = pd.Series({'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
       'Rob': 'rob@gmail.com', 'Wes': np.nan})

In [65]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [71]:
# Regex can be use too, along with any other 're' optionsl ike IGNORECASE
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [72]:
data.str.findall(pattern, flags = re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

In [73]:
# Are a few ways to do *vectorised element retrival*
# Either with 'str.get' or index into the 'str' attribute
matches = data.str.match(pattern, flags = re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

In [70]:
# To access elements in the embedded lists, can pass an index to either of these functions
matches.str.get(1)

## ** I GOT AN ERROR? but y 

AttributeError: Can only use .str accessor with string values!

In [45]:
str(matches)

'Dave     True\nSteve    True\nRob      True\nWes       NaN\ndtype: object'

In [None]:
matches.str[0]

In [47]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object

#### See table 7.5 for more pandas string methods

<h2> <font color = #39abed> 7.4 Conclusion

EFFECTIVE DATA PREP CAN SIGNIFICANTLY IMPROVE PRODUCTIVITY, ALLOWING FOR MORE TIME TO BE SPENT ON ANALYSING THE DATA AND LESS TIME GETTING IT READ FOR ANALYSIS.