# LECTURE 1: Python Pandas Introduction.

### What is Pandas?
**Pandas is a Python Library used for working with Datasets. It has functions for analyzing , cleaning , exploring and manipulating data.**
**The name "Pandas" has a reference to both "Panel Data" and "Python Data Analysis".**
### Why use Pandas?
**Pandas allow us to analyze big data and make conclusions based on statistical theories.**
*Pandas can clean messy data sets,  and make them readable and relevant. Relevant data is very important is data science*
### What Pandas can Do?
**Pandas give you answers About the data like:**

 * Is there a Correlation between two or more columns.?
 * What is average value?
 * Max value?
 * Min value?
   
**Pandas are also able to delete rules that are not relevant or contains wrong values, like empty or null values This is called cleaning the data**
### Where is the Pandas codebase?
**The source code of Pandas is located at this githum repository.**
**https://github.com/pandas-dev/pandas**

In [1]:
import pandas as pd
print(pd.__version__)  # Displaying Pandas version



2.1.4


# LECTURE 2: Python Pandas Series.
A Pandas Series is a one-dimensional labeled array-like object that can hold data of any type. It can be thought of as a column in a spreadsheet or a single column of a DataFrame. The row labels of series are called the index. A Series cannot contain multiple columns.
Here is an example of creating a Pandas Series from a list:

In [2]:
# import pandas as pd
import pandas as pd

# a simple list
list = ['g', 'e', 'e', 'k', 's']

# create series from a list
ser = pd.Series(list)

print(ser)

0    g
1    e
2    e
3    k
4    s
dtype: object


In [3]:
import pandas as pd
print(pd.__version__)

my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
ser = pd.Series(my_list)
print(ser)


2.1.4
0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64


### Labeling 
In a Pandas Series, labeling refers to the process of assigning identifiers to the data points. These identifiers, also known as labels, are used to access specific data points within the series.
The labels in a Pandas Series are index numbers by default. The index number in a series starts from 0, similar to arrays and dataframes. These labels can be used to access a specified value.
Here is an example of accessing elements using labels in a Pandas Series:

In [4]:
# Labeling can be used to access specified values.
# import pandas and numpy
import pandas as pd
import numpy as np

# creating simple array
data = np.array(['g','e','e','k','s','f', 'o','r','g','e','e','k','s'])

# creating series with custom index labels
ser = pd.Series(data, index=[10,11,12,13,14,15,16,17,18,19,20,21,22])

# accessing a element using index label
print(ser[16])

o


In [5]:
# With custom labels, you can create your own named labels.
# import pandas
import pandas as pd

# create a list
my_list = [1, 7, 2]

# create a Pandas Series
ser = pd.Series(my_list)

# print the value at the first index (indexing starts from 0)
print(ser[0])


1


In [6]:
# Labeling - A label can be used to access a specified value after creating your own label.
# import pandas
import pandas as pd

# create a list
my_list = [1, 7, 2]

# create a Pandas Series with custom labels
ser = pd.Series(my_list, index=["x", "y", "z"])

# print the value associated with the 'x' label
print(ser['x'])


1


In [7]:
# You can also use a key-value object, like a dictionary, when creating a series. 
# Here, we will create a simple Pandas series from a dictionary.

# import pandas
import pandas as pd

# create a dictionary
my_dict = {"Day1": 420, "Day2": 380, "Day3": 290}

# create a Pandas Series from the dictionary
ser = pd.Series(my_dict)

# print the resulting Series
print(ser)


Day1    420
Day2    380
Day3    290
dtype: int64


In [8]:
# Now, we will create a series using only data from ‘Day 1’ and ‘Day 2’.
# import pandas
import pandas as pd

# create a dictionary
my_dict = {"Day1": 420, "Day2": 380, "Day3": 290}

# create a Pandas Series with a specified index
ser = pd.Series(my_dict, index=["Day1", "Day2"])

# print the resulting Series
print(ser)


Day1    420
Day2    380
dtype: int64


In [9]:
# “DataFrame: Data sets in Pandas are usually multi-dimensional tables, which are called DataFrames. 
# Series are like columns, and a DataFrame is the whole table.

# import pandas
import pandas as pd

# create a dictionary with lists as values
my_dict = {"cal": [420, 380, 390], "duration": [50, 40, 45]}

# create a Pandas DataFrame from the dictionary
df = pd.DataFrame(my_dict)

# print the resulting DataFrame
print(df)


   cal  duration
0  420        50
1  380        40
2  390        45


# LECTURE 3: Python Pandas Dataframes

In [10]:
# DataFrame: It is a 2D data structure, similar to a 2D array, which includes rows and columns.
# import pandas
import pandas as pd

# create a dictionary with lists as values
data = {"cal": [420, 380, 390], "duration": [50, 40, 45]}

# create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

# print the resulting DataFrame
print(df)



   cal  duration
0  420        50
1  380        40
2  390        45


In [11]:
# Locate Row: Pandas uses the loc attribute to return one or more specified rows.

In [12]:
# import pandas
import pandas as pd

# create a dictionary with lists as values
data = {"cal": [420, 380, 390], "duration": [50, 40, 45]}

# create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

# use loc to access the first row of the DataFrame
print(df.loc[0])



cal         420
duration     50
Name: 0, dtype: int64


In [13]:
# import pandas
import pandas as pd

# create a dictionary with lists as values
data = {"cal": [123, 430, 789], "dur": [70, 65, 89]}

# create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

# use loc to access specific rows (0 and 1) of the DataFrame
print(df.loc[[0, 1]])


   cal  dur
0  123   70
1  430   65


In [14]:
# Named Index: With the ‘index’ argument, you can name your own index.

# import pandas
import pandas as pd

# create a dictionary with lists as values
data = {"cal": [123, 430, 789], "dur": [70, 65, 89]}

# create a Pandas DataFrame from the dictionary with custom index
df = pd.DataFrame(data, index=["Day1", "Day2", "Day3"])

# use loc to access the row with index "Day2" of the DataFrame
print(df.loc["Day2"])


cal    430
dur     65
Name: Day2, dtype: int64


In [15]:
# import pandas
import pandas as pd

# create a dictionary with lists as values
data = {"cal": [123, 430, 789], "dur": [70, 65, 89]}

# create a Pandas DataFrame from the dictionary with custom index
df = pd.DataFrame(data, index=["Day1", "Day2", "Day3"])

# use loc to access specific rows ("Day2" and "Day3") of the DataFrame
print(df.loc[["Day2", "Day3"]])


      cal  dur
Day2  430   65
Day3  789   89


In [16]:
# import pandas
import pandas as pd

# create a dictionary with lists as values
data = {"cal": [123, 430, 789], "dur": [70, 65, 89]}

# create a Pandas DataFrame from the dictionary with custom index
df = pd.DataFrame(data, index=["Day1", "Day2", "Day3"])

# use loc to access the row with index "Day2" of the DataFrame
print(df.loc[["Day2"]])


      cal  dur
Day2  430   65


In [17]:
# Load the data from the CSV file into a DataFrame, i.e., data.csv.

# import pandas
import pandas as pd

# read the CSV file into a DataFrame
df = pd.read_csv('data.csv')

# print the resulting DataFrame
print(df)


     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


# Python Pandas Read CSV : 


In [18]:
# import pandas
import pandas as pd

# read the CSV file into a DataFrame
df = pd.read_csv('data.csv')

# print the resulting DataFrame
print(df)


     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


In [19]:
# print the entire DataFrame as a string
print(df.to_string())


     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

In [20]:
# Print the current value of the maximum number of rows to display when printing a DataFrame
print(pd.options.display.max_rows)


60


In [21]:
# Set the maximum number of rows to display when printing a DataFrame to 9999
pd.options.display.max_rows = 9999


In [22]:
print(df)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

# LECTURE 5: Pandas Read JSON 
**Big data sets are normally stored and extracted as JSON. JSON is plain text, but it has the format of an object.**

In [23]:
# Dictionary as JSON: If your JSON code is not in a file, 
# but in a Python dictionary, you can perform all of the following operations.

# import pandas
import pandas as pd

# create a dictionary representing a dataset
data = {
    "Duration": {
        "0": 60,
        "1": 60,
        "2": 60,
        "3": 45,
        "4": 45,
        "5": 60
    },
    "Pulse": {
        "0": 110,
        "1": 117,
        "2": 103,
        "3": 109,
        "4": 117,
        "5": 102
    },
    "Maxpulse": {
        "0": 130,
        "1": 145,
        "2": 135,
        "3": 175,
        "4": 148,
        "5": 127
    },
    "Calories": {
        "0": 409.1,
        "1": 479.0,
        "2": 340.0,
        "3": 282.4,
        "4": 406.0,
        "5": 300.5
    }
}

# convert the dictionary to a Pandas DataFrame
df = pd.DataFrame(data)

# print the resulting DataFrame
print(df)



   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0
2        60    103       135     340.0
3        45    109       175     282.4
4        45    117       148     406.0
5        60    102       127     300.5


In [24]:
# import pandas
import pandas as pd

# read the JSON file into a DataFrame
df = pd.read_json('data.js')

# print the resulting DataFrame
print(df)


     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.5
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112       NaN
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

# LECTURE 6: Pandas Viewing and Analyzing DataFrames
**Viewing the DataFrame: One of the most used methods for a quick overview of the DataFrame is the head() method. This method returns the headers and a specified number of rows.**

In [25]:
# import pandas
import pandas as pd

# read the CSV file into a DataFrame
df = pd.read_csv("data.csv")

# print the first two rows of the DataFrame
print(df.head(2))


   Duration  Pulse  Maxpulse  Calories
0        60    110       130     409.1
1        60    117       145     479.0


In [26]:
# print the first five rows of the DataFrame
df.head()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
0,60,110,130,409.1
1,60,117,145,479.0
2,60,103,135,340.0
3,45,109,175,282.4
4,45,117,148,406.0


In [27]:
# print the last five rows of the DataFrame
df.tail()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
164,60,105,140,290.8
165,60,110,145,300.0
166,60,115,145,310.2
167,75,120,150,320.4
168,75,125,150,330.4


In [28]:
# Display concise summary information about the DataFrame
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB


In [29]:
# Display detailed information about the DataFrame
df.info(verbose=True)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB


# LECTURE 7: Python Pandas - Cleaning Data
## Cleaning Data: Fixing Bad Data in Your Datasets

Bad data could include:
- Empty cells
- Data in the wrong format
- Duplicate data
- Incorrec
## Empty Cells: Addressing the Issue
Empty cells can lead to inaccurate results. To mitigate this, 
we need to remove rows containing empty cells. The process involves creating a new DataFrame with no empty cells.
t data


In [30]:
# Import pandas library
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("bad_data.csv")


In [31]:
# Create a copy of the DataFrame with no empty cells (NaN values)
df_copy = df.dropna()

# Display the new DataFrame without empty cells
print(df_copy)


    Duration          Date  Pulse  Maxpulse  Calories
0         60  '2020/12/01'    110       130     409.1
1         60  '2020/12/02'    117       145     479.0
2         60  '2020/12/03'    103       135     340.0
3         45  '2020/12/04'    109       175     282.4
4         45  '2020/12/05'    117       148     406.0
5         60  '2020/12/06'    102       127     300.0
6         60  '2020/12/07'    110       136     374.0
7        450  '2020/12/08'    104       134     253.3
8         30  '2020/12/09'    109       133     195.1
9         60  '2020/12/10'     98       124     269.0
10        60  '2020/12/11'    103       147     329.3
11        60  '2020/12/12'    100       120     250.7
12        60  '2020/12/12'    100       120     250.7
13        60  '2020/12/13'    106       128     345.3
14        60  '2020/12/14'    104       132     379.3
15        60  '2020/12/15'     98       123     275.0
16        60  '2020/12/16'     98       120     215.2
17        60  '2020/12/17'  

In [32]:
# If you want to change the original DataFrame, use the inplace=True argument
# This will remove the rows containing null (NaN) values
df.dropna(inplace=True)

# Display the DataFrame after removing rows containing null (NaN) values
print(df.to_string())


    Duration          Date  Pulse  Maxpulse  Calories
0         60  '2020/12/01'    110       130     409.1
1         60  '2020/12/02'    117       145     479.0
2         60  '2020/12/03'    103       135     340.0
3         45  '2020/12/04'    109       175     282.4
4         45  '2020/12/05'    117       148     406.0
5         60  '2020/12/06'    102       127     300.0
6         60  '2020/12/07'    110       136     374.0
7        450  '2020/12/08'    104       134     253.3
8         30  '2020/12/09'    109       133     195.1
9         60  '2020/12/10'     98       124     269.0
10        60  '2020/12/11'    103       147     329.3
11        60  '2020/12/12'    100       120     250.7
12        60  '2020/12/12'    100       120     250.7
13        60  '2020/12/13'    106       128     345.3
14        60  '2020/12/14'    104       132     379.3
15        60  '2020/12/15'     98       123     275.0
16        60  '2020/12/16'     98       120     215.2
17        60  '2020/12/17'  

In [33]:

# Replacing the empty values: We will use the fillna() method to replace empty cells with a specified value

# Import pandas library
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("bad_data.csv")

# Replace empty cells with the value 130
df.fillna(130, inplace=True)

# Display the DataFrame after replacing empty values
print(df)


    Duration          Date  Pulse  Maxpulse  Calories
0         60  '2020/12/01'    110       130     409.1
1         60  '2020/12/02'    117       145     479.0
2         60  '2020/12/03'    103       135     340.0
3         45  '2020/12/04'    109       175     282.4
4         45  '2020/12/05'    117       148     406.0
5         60  '2020/12/06'    102       127     300.0
6         60  '2020/12/07'    110       136     374.0
7        450  '2020/12/08'    104       134     253.3
8         30  '2020/12/09'    109       133     195.1
9         60  '2020/12/10'     98       124     269.0
10        60  '2020/12/11'    103       147     329.3
11        60  '2020/12/12'    100       120     250.7
12        60  '2020/12/12'    100       120     250.7
13        60  '2020/12/13'    106       128     345.3
14        60  '2020/12/14'    104       132     379.3
15        60  '2020/12/15'     98       123     275.0
16        60  '2020/12/16'     98       120     215.2
17        60  '2020/12/17'  

In [34]:
# Import pandas library
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("bad_data.csv")

# Replace empty cells in the "Calories" column with the value 130
df["Calories"].fillna(130, inplace=True)

# Display the DataFrame after replacing empty values
print(df)


    Duration          Date  Pulse  Maxpulse  Calories
0         60  '2020/12/01'    110       130     409.1
1         60  '2020/12/02'    117       145     479.0
2         60  '2020/12/03'    103       135     340.0
3         45  '2020/12/04'    109       175     282.4
4         45  '2020/12/05'    117       148     406.0
5         60  '2020/12/06'    102       127     300.0
6         60  '2020/12/07'    110       136     374.0
7        450  '2020/12/08'    104       134     253.3
8         30  '2020/12/09'    109       133     195.1
9         60  '2020/12/10'     98       124     269.0
10        60  '2020/12/11'    103       147     329.3
11        60  '2020/12/12'    100       120     250.7
12        60  '2020/12/12'    100       120     250.7
13        60  '2020/12/13'    106       128     345.3
14        60  '2020/12/14'    104       132     379.3
15        60  '2020/12/15'     98       123     275.0
16        60  '2020/12/16'     98       120     215.2
17        60  '2020/12/17'  

In [35]:
# Import pandas library
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("bad_data.csv")

# Create a mask to identify cells with empty values in the "Calories" column
mask_before = df["Calories"].isnull()

# Replace empty cells in the "Calories" column with the value 130
df["Calories"].fillna(130, inplace=True)

# Create a mask to identify cells that were changed to the replacement value (130)
mask_after = df["Calories"] == 130

# Display the DataFrame after replacing empty values
print(df)

# Display the indices where the replacement value (130) was applied
indices_replaced = df.index[mask_after]
print("Indices where 130 was replaced:")
print(indices_replaced)


    Duration          Date  Pulse  Maxpulse  Calories
0         60  '2020/12/01'    110       130     409.1
1         60  '2020/12/02'    117       145     479.0
2         60  '2020/12/03'    103       135     340.0
3         45  '2020/12/04'    109       175     282.4
4         45  '2020/12/05'    117       148     406.0
5         60  '2020/12/06'    102       127     300.0
6         60  '2020/12/07'    110       136     374.0
7        450  '2020/12/08'    104       134     253.3
8         30  '2020/12/09'    109       133     195.1
9         60  '2020/12/10'     98       124     269.0
10        60  '2020/12/11'    103       147     329.3
11        60  '2020/12/12'    100       120     250.7
12        60  '2020/12/12'    100       120     250.7
13        60  '2020/12/13'    106       128     345.3
14        60  '2020/12/14'    104       132     379.3
15        60  '2020/12/15'     98       123     275.0
16        60  '2020/12/16'     98       120     215.2
17        60  '2020/12/17'  

## Additionally, we have the option to replace empty cells using mean(), median(), or mode().


In [36]:
# Import pandas library
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("bad_data.csv")

# Calculate the mean of the "Calories" column
df_mean = df["Calories"].mean()

# Replace empty cells in the "Calories" column with the mean value
df["Calories"].fillna(df_mean, inplace=True)

# Display the DataFrame after replacing empty values with the mean
print(df)


    Duration          Date  Pulse  Maxpulse  Calories
0         60  '2020/12/01'    110       130    409.10
1         60  '2020/12/02'    117       145    479.00
2         60  '2020/12/03'    103       135    340.00
3         45  '2020/12/04'    109       175    282.40
4         45  '2020/12/05'    117       148    406.00
5         60  '2020/12/06'    102       127    300.00
6         60  '2020/12/07'    110       136    374.00
7        450  '2020/12/08'    104       134    253.30
8         30  '2020/12/09'    109       133    195.10
9         60  '2020/12/10'     98       124    269.00
10        60  '2020/12/11'    103       147    329.30
11        60  '2020/12/12'    100       120    250.70
12        60  '2020/12/12'    100       120    250.70
13        60  '2020/12/13'    106       128    345.30
14        60  '2020/12/14'    104       132    379.30
15        60  '2020/12/15'     98       123    275.00
16        60  '2020/12/16'     98       120    215.20
17        60  '2020/12/17'  

In [37]:
# Import pandas library
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("bad_data.csv")

# Calculate the median of the "Calories" column
df_median = df["Calories"].median()

# Replace empty cells in the "Calories" column with the median value
df["Calories"].fillna(df_median, inplace=True)

# Display the DataFrame after replacing empty values with the median
print(df)


    Duration          Date  Pulse  Maxpulse  Calories
0         60  '2020/12/01'    110       130     409.1
1         60  '2020/12/02'    117       145     479.0
2         60  '2020/12/03'    103       135     340.0
3         45  '2020/12/04'    109       175     282.4
4         45  '2020/12/05'    117       148     406.0
5         60  '2020/12/06'    102       127     300.0
6         60  '2020/12/07'    110       136     374.0
7        450  '2020/12/08'    104       134     253.3
8         30  '2020/12/09'    109       133     195.1
9         60  '2020/12/10'     98       124     269.0
10        60  '2020/12/11'    103       147     329.3
11        60  '2020/12/12'    100       120     250.7
12        60  '2020/12/12'    100       120     250.7
13        60  '2020/12/13'    106       128     345.3
14        60  '2020/12/14'    104       132     379.3
15        60  '2020/12/15'     98       123     275.0
16        60  '2020/12/16'     98       120     215.2
17        60  '2020/12/17'  

In [38]:
# Import pandas library
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("bad_data.csv")

# Calculate the mode of the "Calories" column
df_mode = df["Calories"].mode()

# Replace empty cells in the "Calories" column with the mode value
df["Calories"].fillna(df_mode.iloc[0], inplace=True)

# Display the DataFrame after replacing empty values with the mode
print(df)


    Duration          Date  Pulse  Maxpulse  Calories
0         60  '2020/12/01'    110       130     409.1
1         60  '2020/12/02'    117       145     479.0
2         60  '2020/12/03'    103       135     340.0
3         45  '2020/12/04'    109       175     282.4
4         45  '2020/12/05'    117       148     406.0
5         60  '2020/12/06'    102       127     300.0
6         60  '2020/12/07'    110       136     374.0
7        450  '2020/12/08'    104       134     253.3
8         30  '2020/12/09'    109       133     195.1
9         60  '2020/12/10'     98       124     269.0
10        60  '2020/12/11'    103       147     329.3
11        60  '2020/12/12'    100       120     250.7
12        60  '2020/12/12'    100       120     250.7
13        60  '2020/12/13'    106       128     345.3
14        60  '2020/12/14'    104       132     379.3
15        60  '2020/12/15'     98       123     275.0
16        60  '2020/12/16'     98       120     215.2
17        60  '2020/12/17'  

### Another way to write above 

In [39]:
# Import pandas library
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("bad_data.csv")

# Calculate the mode of the "Calories" column
df_mode = df["Calories"].mode()[0]

# Replace empty cells in the "Calories" column with the mode value
df["Calories"].fillna(df_mode, inplace=True)

# Display the DataFrame after replacing empty values with the mode
print(df)


    Duration          Date  Pulse  Maxpulse  Calories
0         60  '2020/12/01'    110       130     409.1
1         60  '2020/12/02'    117       145     479.0
2         60  '2020/12/03'    103       135     340.0
3         45  '2020/12/04'    109       175     282.4
4         45  '2020/12/05'    117       148     406.0
5         60  '2020/12/06'    102       127     300.0
6         60  '2020/12/07'    110       136     374.0
7        450  '2020/12/08'    104       134     253.3
8         30  '2020/12/09'    109       133     195.1
9         60  '2020/12/10'     98       124     269.0
10        60  '2020/12/11'    103       147     329.3
11        60  '2020/12/12'    100       120     250.7
12        60  '2020/12/12'    100       120     250.7
13        60  '2020/12/13'    106       128     345.3
14        60  '2020/12/14'    104       132     379.3
15        60  '2020/12/15'     98       123     275.0
16        60  '2020/12/16'     98       120     215.2
17        60  '2020/12/17'  

# LECTURE 8: Python Pandas Cleaning Data of Wrong Format

**Data in the wrong format can be addressed through two approaches: either by removing the problematic rows or by converting all cells to the same format.**

In [40]:
# Import the pandas library
import pandas as pd

# Read the CSV file into a pandas DataFrame
df = pd.read_csv('bad_data.csv')

# Convert the 'Date' column to datetime format, handling errors by coercing invalid values to NaT
df["Date"] = pd.to_datetime(df["Date"], errors='coerce')

# Drop rows with 'NaT' values in the 'Date' column
df = df.dropna(subset=['Date'])

# Fill remaining missing values in the 'Date' column with a default date (e.g., '1900-01-01')

### df['Date'].fillna('1900-01-01', inplace=True)

# Convert the 'Date' column to datetime format again

### df['Date'] = pd.to_datetime(df['Date'])

# Print the resulting DataFrame
print(df)


    Duration       Date  Pulse  Maxpulse  Calories
0         60 2020-12-01    110       130     409.1
1         60 2020-12-02    117       145     479.0
2         60 2020-12-03    103       135     340.0
3         45 2020-12-04    109       175     282.4
4         45 2020-12-05    117       148     406.0
5         60 2020-12-06    102       127     300.0
6         60 2020-12-07    110       136     374.0
7        450 2020-12-08    104       134     253.3
8         30 2020-12-09    109       133     195.1
9         60 2020-12-10     98       124     269.0
10        60 2020-12-11    103       147     329.3
11        60 2020-12-12    100       120     250.7
12        60 2020-12-12    100       120     250.7
13        60 2020-12-13    106       128     345.3
14        60 2020-12-14    104       132     379.3
15        60 2020-12-15     98       123     275.0
16        60 2020-12-16     98       120     215.2
17        60 2020-12-17    100       120     300.0
18        45 2020-12-18     90 

In [41]:
# df["Date"] = pd.to_datetime(df["Date"])
# df["Date"] = pd.to_datetime(df["Date"], format='%Y%m%d')
# df["Date"] = pd.to_datetime(df["Date"], format="'%Y/%m/%d'")
# df["Date"] = pd.to_datetime(df["Date"], infer_datetime_format=True)
# df["Date"] = pd.to_datetime(df["Date"], format='%Y%m%d', errors='coerce')
# df["Date"] = pd.to_datetime(df["Date"].str.strip("'"), format='%Y/%m/%d', errors='coerce')

In [42]:
# Filter data for December 2020
december_data = df[df['Date'].dt.month == 12]

# Calculate the total duration for December
total_duration_december = december_data['Duration'].sum()

# Print the result
print(f"Total duration for December: {total_duration_december} minutes")


Total duration for December: 2085 minutes


In [43]:
import os

# Get the current working directory
directory_path = os.getcwd()

# Get the list of files in the current working directory
file_list = os.listdir(directory_path)

# Print the list of files
print(file_list)


['.ipynb_checkpoints', 'bad_data.csv', 'data.csv', 'data.js', 'data_new.csv', 'Pandas_Part_1.ipynb', 'Pandas_Part_2.ipynb']


In [44]:
# Import the os module for operating system-related functions
import os

# Get the list of files in the current working directory
file_list = os.listdir(os.getcwd())

# Print the list of files
print(file_list)


['.ipynb_checkpoints', 'bad_data.csv', 'data.csv', 'data.js', 'data_new.csv', 'Pandas_Part_1.ipynb', 'Pandas_Part_2.ipynb']


### Done 8th Lecturers.