Readme:


We encourage you to explore more functionalities in 'Python for Data Analysis, 3E' by Wes McKinney, Chapter 7: 'Data Cleaning and Preparation'.</br>
Link: https://wesmckinney.com/book/data-cleaning

In [1]:
import pandas as pd
import numpy as np

<p>
Handling Missing Data. </br> </br>

1. Create a Series from list ["aardvark", np.nan, None, "avocado"] and observe how values 'np.nan' and 'None' are represented.  </br>
2. What is the data type of the Series? </br>
3. Then identify which values are considered 'NA' (Not available) by running isna() method. </br>

</p>


In [23]:
s = pd.Series(["aardvark", np.nan, None, "avocado"]) # None is represented as None with object dtype
print(s)
print(s.isna())


0    aardvark
1         NaN
2        None
3     avocado
dtype: object
0    False
1     True
2     True
3    False
dtype: bool


<p>
For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.</br>
Create a Series from list [1, 2, None] by specifying data type as 'float64' - how the 'None' value will be represented here? Compare it with 'None' representation in the previous task. </br>
</p>


In [30]:
s = pd.Series([1, 2, None], dtype='float64') # None is represented as Nan with float64 dtype
s


0    1.0
1    2.0
2    NaN
dtype: float64

<p>
Create a Series from list [1, np.nan, 3.5, np.nan, 7] and drop the missing values. </br>
</p>


In [31]:
s = pd.Series([1, np.nan, 3.5, np.nan, 7])
print(s)
print(s.dropna())

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64
0    1.0
2    3.5
4    7.0
dtype: float64


<p>
1. Create a dataframe out of nested list [[1., 6.5, 3.], [1., np.nan, np.nan], [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]].</br>
2. Run dropna() method - will it drop a whole row containing at least one missing value or a row where all values are missing? </br>
</p>


In [38]:
df = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan], [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]])
df.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


<p>
Rewrite the dropna() method so it drops only the rows where ALL values are missing.</br>
</p>


In [33]:
df.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


<p>
1. Add column indexed as 3 where all values are NA. </br>
2. Drop the COLUMN where ALL values are NA.
</p>


In [40]:
df[3] = np.nan 
df.dropna(how='all', axis=1)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


<p>
1. Create a df from random float numbers with shape (7, 3). </br>
2. Assign NA to values at row index 0 to 3 inclusively and column index 1. </br>
3. Assign NA to values at row index 0 to 1 inclusively and column index 2. </br>
4. Fill NA values with 0. </br>
5. Fill NA values with 1 for column 1, and fill NA values with 2 for column 2 using a dictionary. </br>

</p>


In [52]:
df = pd.DataFrame(np.random.standard_normal((7, 3)))
df.iloc[:4, 1] = np.nan
df.iloc[:2, 2] = np.nan
df.fillna(0)

d = {1: 1, 2: 2}
df.fillna(d)



Unnamed: 0,0,1,2
0,0.622368,1.0,2.0
1,0.713091,1.0,2.0
2,-1.090831,1.0,-0.677844
3,-0.822798,1.0,0.163402
4,0.881934,0.197549,0.609213
5,-0.786602,-0.590282,0.945367
6,0.155143,1.056522,-1.210088


<p>
1. Run below code and display the result. </br>
2. Return a Boolean Series indicating whether a row is a duplicate. </br>
3. Return a dataframe where the duplicated rows are dropped. </br>
4. Return a dataframe where rows are dropped only if we have duplicates in column k2. </br>

</p>


In [56]:
df = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],
                         "k2": [1, 1, 2, 3, 3, 4, 4]})
print(df)

print(df.duplicated())

print(df.drop_duplicates())

print(df.drop_duplicates(subset='k2'))

    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   4
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
    k1  k2
0  one   1
2  one   2
3  two   3
5  two   4


<p>
Add a new column called 'animal' to below dataframe by mapping meat_to_animal to it. </br>
</p>


In [58]:
df = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon",
                               "pastrami", "corned beef", "bacon",
                               "pastrami", "honey ham", "nova lox"],
                      "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_to_animal = {
  "bacon": "pig",
  "pulled pork": "pig",
  "pastrami": "cow",
  "corned beef": "cow",
  "honey ham": "pig",
  "nova lox": "salmon"
}

df['animal'] = df['food'].map(meat_to_animal)
df

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


<p>
Tip: </br>
- 'map' works element-wise on a Series; </br>
- 'apply' works on a row / column basis of a DataFrame; </br>
- 'applymap' works element-wise on a DataFrame; </br> </br>

We could achieve the same result by mapping below function to the df - run below and analyze the result. </br>

</p>


In [59]:
def get_animal(x):
    return meat_to_animal[x]

df['animal'] = df.food.map(get_animal)
df

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


<p>
As you've already seen, 'map' can be used to modify a subset of values in an object, but 'replace' provides a simpler and more flexible way to do so. </br>
Given below Series replace value -999 with 0, and replace value -1000 with np.nan using replace() method.</br>

</p>


In [61]:
s = pd.Series([1., -999., 2., -999., -1000., 3.])
s.replace([-999, -1000], [0, np.nan])


0    1.0
1    0.0
2    2.0
3    0.0
4    NaN
5    3.0
dtype: float64

<p>
Like values in a Series, axis labels can be similarly transformed by a function or mapping of some form to produce new, differently labeled objects.  </br>
1. Given below dataframe, create and map a custom function that capitalizes the index values. </br>
2. Modify the dataframe in place by assigning the new index to it. </br>

</p>


In [62]:
df = pd.DataFrame(np.arange(12).reshape((3, 4)),
                     index=["Ohio", "Colorado", "New York"],
                     columns=["one", "two", "three", "four"])
def transform(x):
    return x.upper()
df.index = df.index.map(transform)

<p>
If you want to create a transformed version of a dataset without modifying the original, a useful method is 'rename'. </br>
Create a transformed version of the above dataframe by using a rename() method that capitalizes all column names. </br>

</p>


In [64]:
df2 = df.rename(columns=str.upper)
df2

Unnamed: 0,ONE,TWO,THREE,FOUR
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


<p>
Now rename the above dataframe so that index "COLORADO" modifies to "FOO", and column "two" modifies to "bar". </br>
</p>


In [65]:
df.rename(index={"COLORADO": "FOO"}, columns={"two": "bar"})

Unnamed: 0,one,bar,three,four
OHIO,0,1,2,3
FOO,4,5,6,7
NEW YORK,8,9,10,11


Regular Expressions.</br>
Run below code and analyze the result.

In [69]:
import re
text = "foo    bar\t baz  \tqux"
re.split(r"\s+", text)

['foo', 'bar', 'baz', 'qux']

When you call re.split(r"\s+", text), the regular expression is first compiled, and then its split method is called on the passed text. </br>
1. Now compile the regex yourself with re.compile, forming a reusable regex object.</br>
2. Apply the compiled regex object to the 'text' string.</br>
3. Now get a list of all patterns matching the compiled regex object, using the findall method</br></br>

*Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.</br>
*'match' and 'search' are closely related to 'findall'. While 'findall' returns all matches in a string, 'search' returns only the first match. More rigidly, 'match' only matches at the beginning of the string. 

In [71]:
my_re = re.compile(r"\s+")
print(my_re.split(text))
print(my_re.findall(text))

['foo', 'bar', 'baz', 'qux']
['    ', '\t ', '  \t']


<p>
String Functions in pandas.</br>
String and regular expression methods can be applied (passing a lambda or other function) to each value using data.map, but it will fail on the NA (null) values!</br>
To cope with this, Series has array-oriented methods for string operations that skip over and propagate NA values. </br>
These are accessed through Series’s 'str' attribute.</br>
For example, we could check whether each email address has "gmail" in it with str.contains() as shown below. </br>
</p>


In [72]:
data = {"Dave": "dave@google.com", "Steve": "steve@gmail.com",
         "Rob": "rob@gmail.com", "Wes": np.nan}
s = pd.Series(data) 
s.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

<p>
Note that the result of this operation has an object dtype. </br>
pandas has extension types that provide for specialized treatment of strings, integers, and Boolean data.</br>
Run below code and pay attention to the dtype.</br>
These 'string' arrays generally use much less memory and are frequently computationally more efficient for doing operations on large datasets. </br>
</p>


In [75]:
data_as_string_ext = s.astype('string') 
print(s.dtype)
print(data_as_string_ext.dtype)
print(data_as_string_ext.str.contains('gmail'))

object
string
Dave     False
Steve     True
Rob       True
Wes       <NA>
dtype: boolean


<p>
Regular expressions can be used, too, along with any re options like IGNORECASE. </br>
Analyze below pattern, run the code and pay attention to the syntax. </br>

</p>


In [77]:
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
s.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

<p>
There are a couple of ways to do vectorized element retrieval. Either use str.get() or str.index().</br>
Run below code and analyze the result. </br>
</p>


In [83]:
matches = s.str.findall(pattern, flags=re.IGNORECASE).str[0].str.get(1)
print(matches)
print(s.str[:5])

Dave     google
Steve     gmail
Rob       gmail
Wes         NaN
dtype: object
Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object


<p>
The str.extract() method will return the captured groups of a regular expression as a DataFrame.</br>
Run below code and analyze the result. </br>
</p>


In [85]:
s.str.extract(pattern, flags = re.IGNORECASE)

Unnamed: 0,0,1,2
Dave,dave,google,com
Steve,steve,gmail,com
Rob,rob,gmail,com
Wes,,,


<p>
For further reading:</br>
1. Extension data types. </br>
2. Categorical data type.</br>
..............
</p>


<p>
Condition </br>
</p>
