**Handle NAN or String Data in pandas**

In [1]:
import pandas as pd
import numpy as np

In [2]:
iris=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
iris.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


In [3]:
df=iris.copy()
df.columns=["sl","sw","pl","pw","flower_type"]

In [4]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


***For NAN enteris we have two options:***

**1.Fill the enteries.**

**2.Remove it.**

In [5]:
df.iloc[2:4,2:5]=np.nan

In [6]:
df.head(4)
df.describe()

Unnamed: 0,sl,sw,pl,pw
count,149.0,149.0,147.0,147.0
mean,5.848322,3.051007,3.806122,1.219048
std,0.828594,0.433499,1.750351,0.757278
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.4,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


**1.Dropping Na Enteries**

In [7]:
df.dropna(inplace=True)

In [8]:
df.head()

Unnamed: 0,sl,sw,pl,pw,flower_type
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
5,4.6,3.4,1.4,0.3,Iris-setosa
6,5.0,3.4,1.5,0.2,Iris-setosa


In [9]:
df.reset_index(drop=True,inplace=True)

In [11]:
df.head()
df.shape

(147, 5)

In [13]:
df.iloc[2:5,1:4]=np.nan

In [14]:
df.head(6)

Unnamed: 0,sl,sw,pl,pw,flower_type
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,5.4,,,,Iris-setosa
3,4.6,,,,Iris-setosa
4,5.0,,,,Iris-setosa
5,4.4,2.9,1.4,0.2,Iris-setosa


**Deleting rows not always can be done**

**2.Filling it.**

**We would fill it by-:**

***1.mean() of the data***

***2.find most occuring value and fill with it***

In [21]:
#we are doing it columnwise
df.sw.fillna(df.sw.mean(),inplace=True)
df.pl.fillna(df.pl.mean(),inplace=True)
df.pw.fillna(df.pw.mean(),inplace=True)

In [22]:
df.head(6)

Unnamed: 0,sl,sw,pl,pw,flower_type
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,5.4,3.036111,3.853472,1.238194,Iris-setosa
3,4.6,3.036111,3.853472,1.238194,Iris-setosa
4,5.0,3.036111,3.853472,1.238194,Iris-setosa
5,4.4,2.9,1.4,0.2,Iris-setosa


**When columns which has string type data , we need to handle it**

***calculations would be very easy if a column had numeric data***

***in most of the cases string data would be changed to numeric type***

In [30]:
df["gender"]="Female"
si=df.index[0]
df.iloc[0:10,5]="Male"

In [32]:
print(df.head(10))
df.tail(10)

    sl        sw        pl        pw  flower_type gender
0  4.9  3.000000  1.400000  0.200000  Iris-setosa   Male
1  4.7  3.200000  1.300000  0.200000  Iris-setosa   Male
2  5.4  3.036111  3.853472  1.238194  Iris-setosa   Male
3  4.6  3.036111  3.853472  1.238194  Iris-setosa   Male
4  5.0  3.036111  3.853472  1.238194  Iris-setosa   Male
5  4.4  2.900000  1.400000  0.200000  Iris-setosa   Male
6  4.9  3.100000  1.500000  0.100000  Iris-setosa   Male
7  5.4  3.700000  1.500000  0.200000  Iris-setosa   Male
8  4.8  3.400000  1.600000  0.200000  Iris-setosa   Male
9  4.8  3.000000  1.400000  0.100000  Iris-setosa   Male


Unnamed: 0,sl,sw,pl,pw,flower_type,gender
137,6.7,3.1,5.6,2.4,Iris-virginica,Female
138,6.9,3.1,5.1,2.3,Iris-virginica,Female
139,5.8,2.7,5.1,1.9,Iris-virginica,Female
140,6.8,3.2,5.9,2.3,Iris-virginica,Female
141,6.7,3.3,5.7,2.5,Iris-virginica,Female
142,6.7,3.0,5.2,2.3,Iris-virginica,Female
143,6.3,2.5,5.0,1.9,Iris-virginica,Female
144,6.5,3.0,5.2,2.0,Iris-virginica,Female
145,6.2,3.4,5.4,2.3,Iris-virginica,Female
146,5.9,3.0,5.1,1.8,Iris-virginica,Female


**df.column.apply(function_name)**

***apply() takes a function as parameter which iterates over each value and sets the returned value to a new column***

In [33]:
#Male=0 Female=1
def f(g):
    if(g=="Male"):
        return 0
    else:
        return 1    
df["sex"]=df.gender.apply(f)

In [36]:
del df["gender"]
df.head(11)

Unnamed: 0,sl,sw,pl,pw,flower_type,sex
0,4.9,3.0,1.4,0.2,Iris-setosa,0
1,4.7,3.2,1.3,0.2,Iris-setosa,0
2,5.4,3.036111,3.853472,1.238194,Iris-setosa,0
3,4.6,3.036111,3.853472,1.238194,Iris-setosa,0
4,5.0,3.036111,3.853472,1.238194,Iris-setosa,0
5,4.4,2.9,1.4,0.2,Iris-setosa,0
6,4.9,3.1,1.5,0.1,Iris-setosa,0
7,5.4,3.7,1.5,0.2,Iris-setosa,0
8,4.8,3.4,1.6,0.2,Iris-setosa,0
9,4.8,3.0,1.4,0.1,Iris-setosa,0
