## Handle Missing Data

Missing data can potentially give you a wrong result when you analyze data.

In [None]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame({'A':[1,2,np.nan,4,5],
                  'B':[5,3,np.nan,6,7],
                  'C':[1,2,3,3,2],
                  'D':[np.nan,6,8,np.nan,6],
                  'E':[4,5,6,7,8],
                'F':['abc','fgh','dh','pq',np.nan]
                  })

In [None]:
df

#### Check for Null Values

In [None]:
df.isnull()

In [None]:
df.isnull().sum()

### Remove Rows

One way to deal with empty cells is to remove rows that contain empty cells.

In [None]:
df.dropna()

In [None]:
df.dropna(axis=1)

In [None]:
df.dropna(thresh=6)

“thresh=4” means that the rows that have at least 4 non-missing values will be kept. The other ones will be dropped.

Our data frame has 5 columns so the rows that have 2 or more missing values will be dropped.

In [None]:
df

#### Drop based on a particular subset of columns

In [None]:
df.dropna(subset=['D'])

#### Fill with a constant value

Choose a constant value to be used as a replacement for the missing values.

One constant value to the ``fillna`` function, it will replace all the missing values in the data frame with that value.

In [None]:
df.fillna(value='FILL VALUE')

In [None]:
df.fillna(5)

In [None]:
df

#### Replace Only For Specified Columns

In [None]:
df["D"].fillna(5)

#### Fill with an aggregated value

Another option is to use an aggregated value such as mean, median, or mode.

In [None]:
df["D"].fillna(df["D"].mean())

In [None]:
df["D"].fillna(df["D"].mode()[0])

In [None]:
x = df["D"].median()

df["D"].fillna(x)

#### Replace with the previous or next value

Replace the missing values in a column with the previous or next value in that column.

This method is working with time-series data. Suppose you have a data frame that contains the daily temperature measurement and the temperate in one day is missing. The optimal solution would be to use the temperature in the next or previous day.

``dataframe.bfill()`` is used to backward fill the missing values in the dataset.

In [None]:
#Backward Fill

df.fillna(method='bfill')

In [None]:
# Fill across the rows
df.bfill(axis = 1)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            

In [None]:
df


In [None]:
# Forward fill

df.fillna(method='ffill', axis=0)

In [None]:
df

In [None]:
df.ffill(axis=0, inplace = True)

In [None]:
df

### Info on Unique Values

In [None]:
df['D'].unique()

In [None]:
df['D'].nunique()

In [None]:
df['C'].value_counts()

#### Applying Functions

In [None]:
def sq(x):
    return x*2

In [None]:
df

In [None]:
df['A'].apply(sq)

In [None]:
df['F'].apply(len)

In [None]:
df['A'].sum()

#### Permanently Removing a Column

In [None]:
del df['D']

In [None]:
df

### Data Input and Output

#### Input

In [None]:
csv = pd.read_csv(r'..\Datasets\example')
csv

#### How to extract if file is present outside the folder

In [None]:
csv = pd.read_csv(r'C:\Users\Dell\example_outside')
csv

#### Output

In [None]:
csv.to_csv('sample',index=False)

### Excel
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash. 

In [None]:
m = pd.read_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

m

#### Output

In [None]:
m.to_excel('Sample_file.xlsx')

### Questions

Read csv file example from the folder and perform pandas operations on it.

In [1]:
import numpy as np
import pandas as pd

### Get Information of the dataset

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       4 non-null      float64
 1   B       4 non-null      float64
 2   C       5 non-null      int64  
 3   D       3 non-null      float64
 4   E       5 non-null      int64  
 5   F       4 non-null      object 
dtypes: float64(3), int64(2), object(1)
memory usage: 368.0+ bytes


### Handle Wrong Data

In [4]:
data = pd.read_csv(r'..\Datasets\data.csv')
data

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Fit
0,60,'2020/12/01',110,130,409.1,no
1,60,'2020/12/02',117,145,479.0,yes
2,60,'2020/12/03',103,135,340.0,no
3,45,'2020/12/04',109,175,282.4,yes
4,45,'2020/12/05',117,148,406.0,no
5,60,'2020/12/06',102,127,300.0,no
6,60,'2020/12/07',110,136,374.0,no
7,450,'2020/12/08',104,134,253.3,yes
8,30,'2020/12/09',109,133,195.1,yes
9,60,'2020/12/10',98,124,269.0,no


In [5]:
data.isnull().sum()

Duration    0
Date        1
Pulse       0
Maxpulse    0
Calories    2
Fit         0
dtype: int64

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  34 non-null     int64  
 1   Date      33 non-null     object 
 2   Pulse     34 non-null     int64  
 3   Maxpulse  34 non-null     int64  
 4   Calories  32 non-null     float64
 5   Fit       34 non-null     object 
dtypes: float64(1), int64(3), object(2)
memory usage: 1.7+ KB


In [7]:
data["Date"]

0     '2020/12/01'
1     '2020/12/02'
2     '2020/12/03'
3     '2020/12/04'
4     '2020/12/05'
5     '2020/12/06'
6     '2020/12/07'
7     '2020/12/08'
8     '2020/12/09'
9     '2020/12/10'
10    '2020/12/11'
11    '2020/12/12'
12    '2020/12/12'
13    '2020/12/13'
14    '2020/12/14'
15    '2020/12/15'
16    '2020/12/16'
17    '2020/12/17'
18    '2020/12/18'
19    '2020/12/19'
20    '2020/12/20'
21    '2020/12/21'
22             NaN
23      23-12-2020
24    '2020/12/24'
25    '2020/12/25'
26        20201226
27    '2020/12/27'
28    '2020/12/28'
29    '2020/12/29'
30    '2020/12/30'
31    '2020/12/31'
32    '2020/12/31'
33    '2020/12/31'
Name: Date, dtype: object

In [8]:
## Pandas has a to_datetime() method for Converting Date to right Format

data["Date"] = pd.to_datetime(data["Date"])
data["Date"]

  data["Date"] = pd.to_datetime(data["Date"])


0    2020-12-01
1    2020-12-02
2    2020-12-03
3    2020-12-04
4    2020-12-05
5    2020-12-06
6    2020-12-07
7    2020-12-08
8    2020-12-09
9    2020-12-10
10   2020-12-11
11   2020-12-12
12   2020-12-12
13   2020-12-13
14   2020-12-14
15   2020-12-15
16   2020-12-16
17   2020-12-17
18   2020-12-18
19   2020-12-19
20   2020-12-20
21   2020-12-21
22          NaT
23   2020-12-23
24   2020-12-24
25   2020-12-25
26   2020-12-26
27   2020-12-27
28   2020-12-28
29   2020-12-29
30   2020-12-30
31   2020-12-31
32   2020-12-31
33   2020-12-31
Name: Date, dtype: datetime64[ns]

In [9]:
data['Date'].dropna()

0    2020-12-01
1    2020-12-02
2    2020-12-03
3    2020-12-04
4    2020-12-05
5    2020-12-06
6    2020-12-07
7    2020-12-08
8    2020-12-09
9    2020-12-10
10   2020-12-11
11   2020-12-12
12   2020-12-12
13   2020-12-13
14   2020-12-14
15   2020-12-15
16   2020-12-16
17   2020-12-17
18   2020-12-18
19   2020-12-19
20   2020-12-20
21   2020-12-21
23   2020-12-23
24   2020-12-24
25   2020-12-25
26   2020-12-26
27   2020-12-27
28   2020-12-28
29   2020-12-29
30   2020-12-30
31   2020-12-31
32   2020-12-31
33   2020-12-31
Name: Date, dtype: datetime64[ns]

In [10]:
data.columns

Index(['Duration', 'Date', 'Pulse', 'Maxpulse', 'Calories', 'Fit'], dtype='object')

In [11]:
data["Duration"]

0      60
1      60
2      60
3      45
4      45
5      60
6      60
7     450
8      30
9      60
10     60
11     60
12     60
13     60
14     60
15     60
16     89
17     60
18     45
19     60
20     45
21     60
22     45
23     60
24     45
25     61
26     60
27     60
28     60
29     80
30     60
31     60
32     60
33     60
Name: Duration, dtype: int64

In [12]:
# Replace Value

data.loc[7,"Duration"] = 45
data["Duration"]

0     60
1     60
2     60
3     45
4     45
5     60
6     60
7     45
8     30
9     60
10    60
11    60
12    60
13    60
14    60
15    60
16    89
17    60
18    45
19    60
20    45
21    60
22    45
23    60
24    45
25    61
26    60
27    60
28    60
29    80
30    60
31    60
32    60
33    60
Name: Duration, dtype: int64

To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries for legal values, and replace any values that are outside of the boundaries

In [13]:
for x in data.index:
    if data.loc[x,"Duration"] > 60:
        data.loc[x,"Duration"] = 60
        
data["Duration"]

0     60
1     60
2     60
3     45
4     45
5     60
6     60
7     45
8     30
9     60
10    60
11    60
12    60
13    60
14    60
15    60
16    60
17    60
18    45
19    60
20    45
21    60
22    45
23    60
24    45
25    60
26    60
27    60
28    60
29    60
30    60
31    60
32    60
33    60
Name: Duration, dtype: int64

#### Removing Duplicates

Duplicate rows are rows that have been registered more than one time

To discover duplicates, we can use the duplicated() method

The duplicated() method returns a Boolean values for each row

To remove duplicates, use the drop_duplicates() method

In [14]:
data

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Fit
0,60,2020-12-01,110,130,409.1,no
1,60,2020-12-02,117,145,479.0,yes
2,60,2020-12-03,103,135,340.0,no
3,45,2020-12-04,109,175,282.4,yes
4,45,2020-12-05,117,148,406.0,no
5,60,2020-12-06,102,127,300.0,no
6,60,2020-12-07,110,136,374.0,no
7,45,2020-12-08,104,134,253.3,yes
8,30,2020-12-09,109,133,195.1,yes
9,60,2020-12-10,98,124,269.0,no


In [15]:
data.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32     True
33     True
dtype: bool

In [16]:
data.drop_duplicates()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Fit
0,60,2020-12-01,110,130,409.1,no
1,60,2020-12-02,117,145,479.0,yes
2,60,2020-12-03,103,135,340.0,no
3,45,2020-12-04,109,175,282.4,yes
4,45,2020-12-05,117,148,406.0,no
5,60,2020-12-06,102,127,300.0,no
6,60,2020-12-07,110,136,374.0,no
7,45,2020-12-08,104,134,253.3,yes
8,30,2020-12-09,109,133,195.1,yes
9,60,2020-12-10,98,124,269.0,no


In [17]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Duration  34 non-null     int64         
 1   Date      33 non-null     datetime64[ns]
 2   Pulse     34 non-null     int64         
 3   Maxpulse  34 non-null     int64         
 4   Calories  32 non-null     float64       
 5   Fit       34 non-null     object        
dtypes: datetime64[ns](1), float64(1), int64(3), object(1)
memory usage: 1.7+ KB


### Pandas - Data Correlations

Finding Relationships

The corr() method calculates the relationship between each column in your data set

The corr() method ignores "not numeric" columns

The Result of the corr() method is a table with a lot of numbers that represents how well the relationship is between two columns

The number varies from -1 to 1

1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data set, each time a value went up in the first column, the other one went up as well


In [18]:
data.corr()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
Duration,1.0,-0.10715,-0.281099,0.293904
Pulse,-0.10715,1.0,0.336311,0.549325
Maxpulse,-0.281099,0.336311,1.0,0.396167
Calories,0.293904,0.549325,0.396167,1.0


**Perfect Correlation**

We can see that "Duration" and "Duration" got the number 1.000000, which makes sense, each column always has a perfect relationship with itself

**Good Correlation**

"Duration" and "Calories" got a 0.293904 correlation, which is a very good correlation, and we can predict that the longer you work out, the more calories you burn, and the other way around: if you burned a lot of calories, you probably had a long work out

**Bad Correlation**

"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad correlation, meaning that we can not predict the max pulse by just looking at the duration of the work out, and vice versa

### isnull() Conditional Selector

To select NaN entries you can use pd.isnull()

In [19]:
data[pd.isnull(data['Calories'])]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Fit
18,45,2020-12-18,90,112,,no
28,60,2020-12-28,103,132,,yes


In [20]:
data['Fit']

0        no
1       yes
2        no
3       yes
4        no
5        no
6        no
7       yes
8       yes
9        no
10       no
11      yes
12    noyes
13       no
14       no
15      yes
16      yes
17       no
18       no
19      yes
20       no
21       no
22      yes
23       ni
24       no
25       no
26      yes
27      yes
28      yes
29      yes
30       no
31      yes
32      yes
33      yes
Name: Fit, dtype: object

In [21]:
data['Fit'].replace(['noyes'], 'no')

0      no
1     yes
2      no
3     yes
4      no
5      no
6      no
7     yes
8     yes
9      no
10     no
11    yes
12     no
13     no
14     no
15    yes
16    yes
17     no
18     no
19    yes
20     no
21     no
22    yes
23     ni
24     no
25     no
26    yes
27    yes
28    yes
29    yes
30     no
31    yes
32    yes
33    yes
Name: Fit, dtype: object

In [22]:
data


Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Fit
0,60,2020-12-01,110,130,409.1,no
1,60,2020-12-02,117,145,479.0,yes
2,60,2020-12-03,103,135,340.0,no
3,45,2020-12-04,109,175,282.4,yes
4,45,2020-12-05,117,148,406.0,no
5,60,2020-12-06,102,127,300.0,no
6,60,2020-12-07,110,136,374.0,no
7,45,2020-12-08,104,134,253.3,yes
8,30,2020-12-09,109,133,195.1,yes
9,60,2020-12-10,98,124,269.0,no


In [23]:
data['Fit'].replace(['noyes'], 'no', inplace=True)

In [24]:
data

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Fit
0,60,2020-12-01,110,130,409.1,no
1,60,2020-12-02,117,145,479.0,yes
2,60,2020-12-03,103,135,340.0,no
3,45,2020-12-04,109,175,282.4,yes
4,45,2020-12-05,117,148,406.0,no
5,60,2020-12-06,102,127,300.0,no
6,60,2020-12-07,110,136,374.0,no
7,45,2020-12-08,104,134,253.3,yes
8,30,2020-12-09,109,133,195.1,yes
9,60,2020-12-10,98,124,269.0,no


In [27]:
data["Fit_encoded"]=data['Fit'].map(dict(yes=1, no=0))

In [28]:
data

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Fit,Fit_encoded
0,60,2020-12-01,110,130,409.1,no,0.0
1,60,2020-12-02,117,145,479.0,yes,1.0
2,60,2020-12-03,103,135,340.0,no,0.0
3,45,2020-12-04,109,175,282.4,yes,1.0
4,45,2020-12-05,117,148,406.0,no,0.0
5,60,2020-12-06,102,127,300.0,no,0.0
6,60,2020-12-07,110,136,374.0,no,0.0
7,45,2020-12-08,104,134,253.3,yes,1.0
8,30,2020-12-09,109,133,195.1,yes,1.0
9,60,2020-12-10,98,124,269.0,no,0.0
