<a href="https://colab.research.google.com/github/machave11/Python---Data-Science/blob/main/Python_Data_Cleaning_and_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skimage.io import imread
from numpy import nan as NA

Doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst’s time. Sometimes the way that data is stored in files or databases is not in the right format for a particular task.

Fortunately, pandas, along with the built-in Python language features, provides you with a high-level, flexible, and fast set of tools to enable you to manipulate data into the right form.

In [50]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [51]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

# NOTE- The build in python None value treated as NA in object array

In [52]:
string_data[0]= None

In [53]:
string_data

0         None
1    artichoke
2          NaN
3      avocado
dtype: object

In [54]:
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

# Filtering out missing data

In [55]:
data = pd.Series([1,NA,3.5,NA,5])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    5.0
dtype: float64

In [56]:
data.dropna()

0    1.0
2    3.5
4    5.0
dtype: float64

# You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by default drops any row containing a missing value:

In [57]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])

data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [58]:
cleaned  = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


HOW = 'All' only drop that rows that are contain NA value


In [59]:
# how=;all only drop that rows that are contain NA value
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


# In same way to drop column then axis=1

In [60]:
data[4]=NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [61]:
data.dropna(how='all', axis=1)

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


# A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument:

In [62]:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 2] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.88034,0.525394,
1,-0.219627,-0.649197,
2,-0.97302,-0.431282,
3,-0.632329,-1.107158,
4,-0.138062,-0.190705,0.109021
5,-0.152433,1.072356,0.873093
6,0.020861,0.335139,1.16535


In [63]:
df.dropna()

Unnamed: 0,0,1,2
4,-0.138062,-0.190705,0.109021
5,-0.152433,1.072356,0.873093
6,0.020861,0.335139,1.16535


In [64]:
df.dropna(thresh=2)

Unnamed: 0,0,1,2
0,-0.88034,0.525394,
1,-0.219627,-0.649197,
2,-0.97302,-0.431282,
3,-0.632329,-1.107158,
4,-0.138062,-0.190705,0.109021
5,-0.152433,1.072356,0.873093
6,0.020861,0.335139,1.16535


# Filling missing values
Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most pur‐ poses, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value:

In [65]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.88034,0.525394,0.0
1,-0.219627,-0.649197,0.0
2,-0.97302,-0.431282,0.0
3,-0.632329,-1.107158,0.0
4,-0.138062,-0.190705,0.109021
5,-0.152433,1.072356,0.873093
6,0.020861,0.335139,1.16535


# Calling fillna with dict, you can use different fill value for each column

In [66]:
df.fillna({1:0.5, 2:1})

Unnamed: 0,0,1,2
0,-0.88034,0.525394,1.0
1,-0.219627,-0.649197,1.0
2,-0.97302,-0.431282,1.0
3,-0.632329,-1.107158,1.0
4,-0.138062,-0.190705,0.109021
5,-0.152433,1.072356,0.873093
6,0.020861,0.335139,1.16535


# NOTE: fillna returns a new object, but you can modify the existing object in-place:

In [67]:
data = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-0.88034,0.525394,0.0
1,-0.219627,-0.649197,0.0
2,-0.97302,-0.431282,0.0
3,-0.632329,-1.107158,0.0
4,-0.138062,-0.190705,0.109021
5,-0.152433,1.072356,0.873093
6,0.020861,0.335139,1.16535


# The same interpolation methods available for reindexing can be used with fillna

In [73]:
df = pd.DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.210306,0.910365,-0.601312
1,-0.602998,0.029004,0.540567
2,1.547826,,-0.39595
3,0.380266,,-1.338918
4,-0.85624,,
5,-1.411339,,


In [74]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,-0.210306,0.910365,-0.601312
1,-0.602998,0.029004,0.540567
2,1.547826,0.029004,-0.39595
3,0.380266,0.029004,-1.338918
4,-0.85624,0.029004,-1.338918
5,-1.411339,0.029004,-1.338918


In [75]:
df.fillna(method='ffill', limit = 2)

Unnamed: 0,0,1,2
0,-0.210306,0.910365,-0.601312
1,-0.602998,0.029004,0.540567
2,1.547826,0.029004,-0.39595
3,0.380266,0.029004,-1.338918
4,-0.85624,,-1.338918
5,-1.411339,,-1.338918


# With a fillna we can do lots of creative things for example: you might pass mean or median value of series

In [78]:
data = pd.Series([1,NA,2.5,3,NA])
data.fillna(data.mean())


0    1.000000
1    2.166667
2    2.500000
3    3.000000
4    2.166667
dtype: float64

# Data Transformation

In [84]:
# Removing Duplicates
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
 'k2': [1, 1, 2, 3, 3, 4, 4]})

data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


# The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not:

In [85]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

#Relatedly, drop_duplicates returns a DataFrame where the duplicated array is False:

In [88]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


#Both of these methods by default consider all of the columns; alternatively, you can specify any subset of them to detect duplicates. Suppose we had an additional column of values and wanted to filter duplicates only based on the 'k1' column:

In [89]:
data['v1']=range(7)
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [91]:
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


# Transforming Data Using a Function or Mapping

For many datasets, you may wish to perform some transformation based on the val‐ ues in an array, Series, or column in a DataFrame. Consider the following hypotheti‐ cal data collected about various kinds of meat:

In [92]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
 ....: 'Pastrami', 'corned beef', 'Bacon',
 ....: 'pastrami', 'honey ham', 'nova lox'],
 ....: 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping of each distinct meat type to the kind of animal:

In [93]:
meat_to_animal = {
 'bacon': 'pig',
 'pulled pork': 'pig',
 'pastrami': 'cow',
 'corned beef': 'cow',
 'honey ham': 'pig',
 'nova lox': 'salmon'
}