# Pandas - Cleaning Data of Wrong Format

- Cells with data of wrong format can make it difficult, or even impossible, to analyze data.

- To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.


### Convert Into a Correct Format

In our Data Frame, we have two cells with the wrong format. Check out red highlight.


![Diagram](markdownImage/excel.png)

---

Let's try to convert all cells in the 'Date' column into dates.

Pandas has a to_datetime() method for this:


In [5]:
# Convert to date:

import pandas as pd

df = pd.read_csv("data.csv")

df["Date"] = pd.to_datetime(df["Date"], format="mixed")

print(df.to_string())

    Duration       Date  Pulse  Maxpulse  Calories
0         60 2020-12-01    110       130     409.1
1         60 2020-12-02    117       145     479.0
2         60 2020-12-03    103       135     340.0
3         45 2020-12-04    109       175     282.4
4         45 2020-12-05    117       148     406.0
5         60 2020-12-06    102       127     300.0
6         60 2020-12-07    110       136     374.0
7        450 2020-12-08    104       134     253.3
8         30 2020-12-09    109       133     195.1
9         60 2020-12-10     98       124     269.0
10        60 2020-12-11    103       147     329.3
11        60 2020-12-12    100       120     250.7
12        60 2020-12-12    100       120     250.7
13        60 2020-12-13    106       128     345.3
14        60 2020-12-14    104       132     379.3
15        60 2020-12-15     98       123     275.0
16        60 2020-12-16     98       120     215.2
17        60 2020-12-17    100       120     300.0
18        45 2020-12-18     90 

### Removing Rows

The result from the converting in the example above gave us a NaT value, which can be handled as a NULL value, and we can remove the row by using the dropna() method.


In [6]:
df.dropna(subset=["Date"], inplace=True)

In [7]:
print(df.to_string())

    Duration       Date  Pulse  Maxpulse  Calories
0         60 2020-12-01    110       130     409.1
1         60 2020-12-02    117       145     479.0
2         60 2020-12-03    103       135     340.0
3         45 2020-12-04    109       175     282.4
4         45 2020-12-05    117       148     406.0
5         60 2020-12-06    102       127     300.0
6         60 2020-12-07    110       136     374.0
7        450 2020-12-08    104       134     253.3
8         30 2020-12-09    109       133     195.1
9         60 2020-12-10     98       124     269.0
10        60 2020-12-11    103       147     329.3
11        60 2020-12-12    100       120     250.7
12        60 2020-12-12    100       120     250.7
13        60 2020-12-13    106       128     345.3
14        60 2020-12-14    104       132     379.3
15        60 2020-12-15     98       123     275.0
16        60 2020-12-16     98       120     215.2
17        60 2020-12-17    100       120     300.0
18        45 2020-12-18     90 

## Adding Rows and calling Function and Changing DataTypes

- Changing datatypes with `.astype()`
- Calling function using `.apply()`

In [10]:
df["New_Column"] = df["Pulse"].astype(int).apply(lambda x: x * 2)
df.head()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,New_Column
0,60,2020-12-01,110,130,409.1,220
1,60,2020-12-02,117,145,479.0,234
2,60,2020-12-03,103,135,340.0,206
3,45,2020-12-04,109,175,282.4,218
4,45,2020-12-05,117,148,406.0,234


## Grouping Data

In [15]:
grouped_df = df.groupby(["Duration"])['Pulse'].sum()
print(grouped_df)

Duration
30      109
45      518
60     2481
450     104
Name: Pulse, dtype: int64


In [16]:
grouped_df = df.groupby(["Duration","Maxpulse"])['Pulse'].sum()
print(grouped_df)

Duration  Maxpulse
30        133         109
45        112          90
          125          97
          132         105
          148         117
          175         109
60        101         130
          115          92
          118          92
          120         498
          123         201
          124          98
          126         102
          127         102
          128         106
          129         102
          130         110
          131         108
          132         307
          135         103
          136         110
          145         117
          147         103
450       134         104
Name: Pulse, dtype: int64


In [17]:
# aggregating multiple functions
grouped_agg = df.groupby(["Duration","Maxpulse"])['Pulse'].agg(['sum', 'mean', 'max'])
print(grouped_agg)

                   sum        mean  max
Duration Maxpulse                      
30       133       109  109.000000  109
45       112        90   90.000000   90
         125        97   97.000000   97
         132       105  105.000000  105
         148       117  117.000000  117
         175       109  109.000000  109
60       101       130  130.000000  130
         115        92   92.000000   92
         118        92   92.000000   92
         120       498   99.600000  100
         123       201  100.500000  103
         124        98   98.000000   98
         126       102  102.000000  102
         127       102  102.000000  102
         128       106  106.000000  106
         129       102  102.000000  102
         130       110  110.000000  110
         131       108  108.000000  108
         132       307  102.333333  104
         135       103  103.000000  103
         136       110  110.000000  110
         145       117  117.000000  117
         147       103  103.000000  103
