### Create a new column from mulitple columns in your dataframe
1. **df** is your dataframe
2. **row** will correspond to each row in your data frame
3. **func** is the function you want to apply to your data frame
4. **axis**=1 to apply the function to each row in your data frame

**apply** and **lambda** can help you eaily apply whatever logic to your columns using the following format :

**df[new_col] = df.apply(lambda row: func(row), axis=1)**

In [None]:
!pip install pandas --upgrade

In [196]:
import pandas as pd

In [197]:
# Create the dataframe
candidates = {
    'Name':["Emily","Lucas","Camile","Gabriel","Steven"],
    'Degree':['Master','Master','Bachelor','PhD','Master'],
    'From':['Arizona','Chicago','San Franciso','New York','Ohio'],
    'Years_exp':[2,3,5,6,0],
    'From_office(min)':[120,80,65,100,30]
}

candidates_df = pd.DataFrame(candidates)

In [198]:
# Custom Function
# candidate_info() combines each candidate's information
# to create a single description column about that candidate
def candidate_info(row):
    # Select columns of interest
    name = row.Name
    is_from = row.From
    year_exp = row.Years_exp
    degree = row.Degree
    from_office = row["From_office(min)"]
    
    # Generate the description from previous variables
    info = f"""{name} from {is_from} holds a {degree} degree 
           with {year_exp} year(s) experience and lives 
           {from_office} from the office"""
    return info

In [199]:
# Application of the above function to the data 
candidates_df["Description"] = candidates_df.apply(lambda row: candidate_info(row), axis=1)

In [200]:
candidates_df

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description
0,Emily,Master,Arizona,2,120,Emily from Arizona holds a Master degree \n ...
1,Lucas,Master,Chicago,3,80,Lucas from Chicago holds a Master degree \n ...
2,Camile,Bachelor,San Franciso,5,65,Camile from San Franciso holds a Bachelor degr...
3,Gabriel,PhD,New York,6,100,Gabriel from New York holds a PhD degree \n ...
4,Steven,Master,Ohio,0,30,Steven from Ohio holds a Master degree \n ...


### Convert categorical data into numerical ones
<p>This processs mainly occurs in feature engineering phase. Some of its benefits are : </p>

1. The identification of outliers, invalid, and missing values in data.

2. Reduction of the chance of overfitting by creating more robust models.

**.cut**() to specifically define your bin edges

In [201]:
# Categorize candidates by expertise with respect to their 
# num of experience :
## Entry level : 0-1 years
## Mid-level : 2-3 years
## Senior level : 4-6 years

seniority = ['Entry','Mid Level','Senior Level']
seniority_bins = [0,1,3,6]
candidates_df['Seniority'] = pd.cut(candidates_df['Years_exp'],
                                    bins=seniority_bins,
                                    labels=seniority,
                                    include_lowest=True)
candidates_df

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description,Seniority
0,Emily,Master,Arizona,2,120,Emily from Arizona holds a Master degree \n ...,Mid Level
1,Lucas,Master,Chicago,3,80,Lucas from Chicago holds a Master degree \n ...,Mid Level
2,Camile,Bachelor,San Franciso,5,65,Camile from San Franciso holds a Bachelor degr...,Senior Level
3,Gabriel,PhD,New York,6,100,Gabriel from New York holds a PhD degree \n ...,Senior Level
4,Steven,Master,Ohio,0,30,Steven from Ohio holds a Master degree \n ...,Entry


**qcut**() to divide your data into equal-sized bins

It uses the underlying percentiles of the distribution of the data, rather than the edges of the bins.

In [202]:
# Categorize the commute time of the candidates into :
## Good, Acceptable, or Too Long
commute_time_labels = ["good","acceptable","too long"]
candidates_df["Commute_level"] = pd.qcut(
                                        candidates_df["From_office(min)"],
                                        q =3,
                                        labels=commute_time_labels
                                        )
candidates_df

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description,Seniority,Commute_level
0,Emily,Master,Arizona,2,120,Emily from Arizona holds a Master degree \n ...,Mid Level,too long
1,Lucas,Master,Chicago,3,80,Lucas from Chicago holds a Master degree \n ...,Mid Level,acceptable
2,Camile,Bachelor,San Franciso,5,65,Camile from San Franciso holds a Bachelor degr...,Senior Level,good
3,Gabriel,PhD,New York,6,100,Gabriel from New York holds a PhD degree \n ...,Senior Level,too long
4,Steven,Master,Ohio,0,30,Steven from Ohio holds a Master degree \n ...,Entry,good


Notes :
1. When using **.cut**(): a number of bins = number of labels + 1
2. When using **.qcut**(): a number of bins = number of labels
3. With **.cut**(): set **include_lowest=True**, otherwise, the lowest value will be converted to NaN

### Select rows from a Pandas Dataframe based on column(s) values

1. use **.query**() function by specifying the filter condition
2. the filter expression can contain any operators(<,>,==,!=,etc.)
3. use the **@** sign to use a variable in the expression

In [203]:
# Get all the candidates with a Master degree
ms_candidates = candidates_df.query("Degree == 'Master'")

In [204]:
ms_candidates

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description,Seniority,Commute_level
0,Emily,Master,Arizona,2,120,Emily from Arizona holds a Master degree \n ...,Mid Level,too long
1,Lucas,Master,Chicago,3,80,Lucas from Chicago holds a Master degree \n ...,Mid Level,acceptable
4,Steven,Master,Ohio,0,30,Steven from Ohio holds a Master degree \n ...,Entry,good


In [205]:
# Get non-bachelor candidates
no_bs_candidates = candidates_df.query("Degree != 'Bachelor'")

In [206]:
no_bs_candidates

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description,Seniority,Commute_level
0,Emily,Master,Arizona,2,120,Emily from Arizona holds a Master degree \n ...,Mid Level,too long
1,Lucas,Master,Chicago,3,80,Lucas from Chicago holds a Master degree \n ...,Mid Level,acceptable
3,Gabriel,PhD,New York,6,100,Gabriel from New York holds a PhD degree \n ...,Senior Level,too long
4,Steven,Master,Ohio,0,30,Steven from Ohio holds a Master degree \n ...,Entry,good


In [207]:
# Get values from list
list_locations = ["Chicago", "Arizona"]
candidates = candidates_df.query("From in @list_locations")

In [208]:
candidates

Unnamed: 0,Name,Degree,From,Years_exp,From_office(min),Description,Seniority,Commute_level
0,Emily,Master,Arizona,2,120,Emily from Arizona holds a Master degree \n ...,Mid Level,too long
1,Lucas,Master,Chicago,3,80,Lucas from Chicago holds a Master degree \n ...,Mid Level,acceptable


### Deal with zip files

In [209]:
# Case 1 : Write Zip Files
## Read data from internet
df = pd.read_csv('https://raw.githubusercontent.com/scpike/us-state-county-zip/master/geo-data.csv')
df.head()

Unnamed: 0,state_fips,state,state_abbr,zipcode,county,city
0,1,Alabama,AL,35004,St. Clair,Acmar
1,1,Alabama,AL,35005,Jefferson,Adamsville
2,1,Alabama,AL,35006,Jefferson,Adger
3,1,Alabama,AL,35007,Shelby,Keystone
4,1,Alabama,AL,35010,Tallapoosa,New site


In [210]:
df.to_csv('geo-data.csv')

# Save it as a zip file
df.to_csv("geo-data.csv.zip", compression="zip")

# Check the files sizes
from os import path
path.getsize('geo-data.csv') / path.getsize('geo-data.csv.zip')

3.7266106274091118

In [211]:
# Case 2 : Read Zip Files
## To read a single zip file
df_unzip = pd.read_csv("geo-data.csv.zip", compression="zip")

In [None]:
from zipfile import ZipFile
## To read the file from a zip folder
sales_df = pd.read_csv("data.zip").open('data/sales_df.csv')

### Select a subset of your Pandas Dataframe with specific column types
Use the **select_dtypes**. It takes two main parameters: **include** and **exclude**

1. **df.select_dtypes(include=['type_1','type_2',..])** means I want the subset of my data frame with columns of type_1, type_2...
2. **df.select_dtypes(exclude=['type_1', 'type_2',..])** means I want the subset of my data frame WITHOUT columns of type_1, type_2..

In [213]:
# Check the data columns' types
candidates_df.dtypes

Name                  object
Degree                object
From                  object
Years_exp              int64
From_office(min)       int64
Description           object
Seniority           category
Commute_level       category
dtype: object

In [214]:
# Only select columns of type "object" & "category"
candidates_df.select_dtypes(include=["object","category"])

Unnamed: 0,Name,Degree,From,Description,Seniority,Commute_level
0,Emily,Master,Arizona,Emily from Arizona holds a Master degree \n ...,Mid Level,too long
1,Lucas,Master,Chicago,Lucas from Chicago holds a Master degree \n ...,Mid Level,acceptable
2,Camile,Bachelor,San Franciso,Camile from San Franciso holds a Bachelor degr...,Senior Level,good
3,Gabriel,PhD,New York,Gabriel from New York holds a PhD degree \n ...,Senior Level,too long
4,Steven,Master,Ohio,Steven from Ohio holds a Master degree \n ...,Entry,good


In [215]:
# Exclude columns of type "int64" & "category"
candidates_df.select_dtypes(exclude=["int64", "category"])

Unnamed: 0,Name,Degree,From,Description
0,Emily,Master,Arizona,Emily from Arizona holds a Master degree \n ...
1,Lucas,Master,Chicago,Lucas from Chicago holds a Master degree \n ...
2,Camile,Bachelor,San Franciso,Camile from San Franciso holds a Bachelor degr...
3,Gabriel,PhD,New York,Gabriel from New York holds a PhD degree \n ...
4,Steven,Master,Ohio,Steven from Ohio holds a Master degree \n ...


### Remove comments from Pandas Dataframe column
This can be done on the fly while loading your pandas dataframe using the **comment** parameter as follows :

clean_data = pd.read_csv(path_to_data, comment="symbol")

But what in case if I want to create a new column for those comments and still remove them from the application date column?

In [216]:
temp_df = {
    'Name':["Emily","Lucas","Camile","Gabriel","Steven"],
    'Degree':['Master','Master','Bachelor','PhD','Master'],
    'From':['Arizona','Chicago','San Franciso','New York','Ohio'],
    'From_office(min)':[120,80,65,100,30],
    'application_date':[
                    '17/11/2022 # more interested in Machine Learning',
                    '23/09/2022 #open to any type of Data Role',
                    '02/12/2021 # will be available in 6 months',
                    '25/08/2022 # only interested in Senior positions',
                    '07/01/2022 # can relocate to any other cities'
    ]
}

In [217]:
df = pd.DataFrame(temp_df)

In [218]:
df.to_csv('temp_df.csv')

In [219]:
# Messy Data
messy_df = pd.read_csv("temp_df.csv")

In [220]:
messy_df

Unnamed: 0.1,Unnamed: 0,Name,Degree,From,From_office(min),application_date
0,0,Emily,Master,Arizona,120,17/11/2022 # more interested in Machine Learning
1,1,Lucas,Master,Chicago,80,23/09/2022 #open to any type of Data Role
2,2,Camile,Bachelor,San Franciso,65,02/12/2021 # will be available in 6 months
3,3,Gabriel,PhD,New York,100,25/08/2022 # only interested in Senior positions
4,4,Steven,Master,Ohio,30,07/01/2022 # can relocate to any other cities


In [221]:
# Scenario 1 : Remove Comments
## clean_data = pd.read_csv(path_to_data, comment="symbol")
clean_df = pd.read_csv("temp_df.csv", comment="#")
clean_df

Unnamed: 0.1,Unnamed: 0,Name,Degree,From,From_office(min),application_date
0,0,Emily,Master,Arizona,120,17/11/2022
1,1,Lucas,Master,Chicago,80,23/09/2022
2,2,Camile,Bachelor,San Franciso,65,02/12/2021
3,3,Gabriel,PhD,New York,100,25/08/2022
4,4,Steven,Master,Ohio,30,07/01/2022


In [222]:
# Scenario 2 : Create new column for comments
messy_df[['application_date', 'comment']] = messy_df['application_date'].str.split('#',1,expand=True)
messy_df

Unnamed: 0.1,Unnamed: 0,Name,Degree,From,From_office(min),application_date,comment
0,0,Emily,Master,Arizona,120,17/11/2022,more interested in Machine Learning
1,1,Lucas,Master,Chicago,80,23/09/2022,open to any type of Data Role
2,2,Camile,Bachelor,San Franciso,65,02/12/2021,will be available in 6 months
3,3,Gabriel,PhD,New York,100,25/08/2022,only interested in Senior positions
4,4,Steven,Master,Ohio,30,07/01/2022,can relocate to any other cities


### Print Pandas Dataframe in Tabular format from consol
1. The **print**() function to a pandas data frame does not always render an output that is easy to read, especially for data frames with multiple columns.
2. Use the **.to_string**() function

In [223]:
data_URL = "https://raw.githubusercontent.com/scpike/us-state-county-zip/master/geo-data.csv"

# Read your data frame
geo_df = pd.read_csv(data_URL)

# Printing without to_string() function
print(geo_df.head())

   state_fips    state state_abbr zipcode      county        city
0           1  Alabama         AL   35004   St. Clair       Acmar
1           1  Alabama         AL   35005   Jefferson  Adamsville
2           1  Alabama         AL   35006   Jefferson       Adger
3           1  Alabama         AL   35007      Shelby    Keystone
4           1  Alabama         AL   35010  Tallapoosa    New site


In [224]:
# Printing with to_string() function
print(geo_df.head().to_string())

   state_fips    state state_abbr zipcode      county        city
0           1  Alabama         AL   35004   St. Clair       Acmar
1           1  Alabama         AL   35005   Jefferson  Adamsville
2           1  Alabama         AL   35006   Jefferson       Adger
3           1  Alabama         AL   35007      Shelby    Keystone
4           1  Alabama         AL   35010  Tallapoosa    New site


### Highlight data points in Pandas
To emphasize certain data points for quick analysis.
Use **pandas.style** module which has many features :
1. df.style.highlight_max() to assign a color to the max value of each column
2. df.style.highlight_min() to assign a color to the min value of each column
3. df.style.apply(my_custom_function) to apply your custom function to your data frame

In [225]:
my_info = {
     "Salary": [100000.2, 95000.9, 103000.2, 65984.1, 150987.08], 
    "Height": [6.5, 5.2, 5.59, 6.7, 6.92], 
    "weight": [185.23, 105.12, 110.3, 190.12, 200.59]     
}
my_data =pd.DataFrame(my_info)

In [226]:
# Function to highlight min and max
def highlight_min_max(data_frame, min_color, max_color):
    # This first line creates a style object
    final_data = data_frame.style.highlight_max(color = max_color)
    
    # On this second line, no need to use .style
    final_data = final_data.highlight_min(color = min_color)
    return final_data

In [None]:
highlight_min_max(my_data, min_color='orange', max_color='green')

In [None]:
# Custom function: apply RED or GREEN whether data is below 
# or above the mean.
def highlight_values(data_row):
  low_value_color = "background-color:#C4606B  ; color: white;"
  high_value_color = "background-color: #C4DE6B; color: white;"   
  filter = data_row < data_row.mean()

  return [low_value_color if low_value else high_value_color for low_value in filter]

In [None]:
# Application of my custom function to only 'Height' & 'weight'
my_data.style.apply(highlight_values, subset=['Height', 'weight'])

### Reduce decimal points in your data
Use Pandas **.DataFrame.round**() function

In [228]:
long_decimals_info = {
    "Salary": [100000.23400000, 95000.900300, 103000.2300535, 65984.14000450, 150987.080345], 
    "Height": [6.501050, 5.270000, 5.5900001050, 6.730001050, 6.92100050], 
    "weight": [185.23000059, 105.1200099, 110.350003, 190.12000000, 200.59000000]      
}

In [229]:
long_decimals_df = pd.DataFrame(long_decimals_info)

In [230]:
fewer_decimals_df = long_decimals_df.round(decimals=2)
fewer_decimals_df

Unnamed: 0,Salary,Height,weight
0,100000.23,6.5,185.23
1,95000.9,5.27,105.12
2,103000.23,5.59,110.35
3,65984.14,6.73,190.12
4,150987.08,6.92,200.59


### Replace some values in your data frame
Use Pandas **.DataFrame.replace**() function

In [231]:
import numpy as np

In [232]:
client_info = {
    'Name':["Emily","Lucas","Camile","Gabriel","Steven"],
    'Degree':['Masters','Master','Bachelor','PhD','Master'],
    'From':['Arizona','Chicago','San Franciso','New York','Ohio'],
    'Age':[23,26,19,np.nan,25]
}

client_df = pd.DataFrame(client_info)

In [233]:
client_df

Unnamed: 0,Name,Degree,From,Age
0,Emily,Masters,Arizona,23.0
1,Lucas,Master,Chicago,26.0
2,Camile,Bachelor,San Franciso,19.0
3,Gabriel,PhD,New York,
4,Steven,Master,Ohio,25.0


In [234]:
# Replace Masters, Master by MS
degrees_to_replace = ["Master", "Masters"]
client_df.replace(to_replace = degrees_to_replace, value = "MS",
                  inplace = True)
client_df

Unnamed: 0,Name,Degree,From,Age
0,Emily,MS,Arizona,23.0
1,Lucas,MS,Chicago,26.0
2,Camile,Bachelor,San Franciso,19.0
3,Gabriel,PhD,New York,
4,Steven,MS,Ohio,25.0


In [235]:
# Replace all the NaN by "Missing"
client_df.replace(to_replace = np.nan, value = "Missing",
                  inplace=True)

In [236]:
client_df

Unnamed: 0,Name,Degree,From,Age
0,Emily,MS,Arizona,23.0
1,Lucas,MS,Chicago,26.0
2,Camile,Bachelor,San Franciso,19.0
3,Gabriel,PhD,New York,Missing
4,Steven,MS,Ohio,25.0


### Compare two DataFrame and get their difference
Use **.compare**() function 
1. It generates a data frame showing columns with differences side by side. It's shape is different from (0,0) only if the two data being compared are the same.
2. If you want to show values that are equal, set the **keep_equal** parameter to True, otherwise, they are shown as NaN.

In [237]:
student_df = pd.DataFrame(client_info)

In [238]:
# Create a second dataframe by changing "Full_Name" & "Age" columns
student_df_test = student_df.copy()
student_df_test.loc[0, 'Name'] = 'Emily'
student_df_test.loc[2, 'Age'] = 19.0

In [239]:
# Compare the two dataframes : student_df and student_df_test
## Comparison showing only unmatching values
student_df.compare(student_df_test)

In [240]:
## Comparison including similar values
student_df.compare(student_df_test, keep_equal=True)

### Get a subset of a very large dataset for quick analysis
Use **nrows** parameter in the pandas **read_csv**() function by specifying the num of rows you want

In [241]:
URL = "https://raw.githubusercontent.com/scpike/us-state-county-zip/master/geo-data.csv"

In [242]:
read_whole_data = pd.read_csv(URL)

In [243]:
read_whole_data.size

198618

In [244]:
# Read all the data in memory before getting the sample
sample_size = 400
sample_data = read_whole_data.head(sample_size)

In [245]:
sample_data.size

2400

In [246]:
# Read the sample on the fly
read_sample = pd.read_csv(URL, nrows=sample_size)

In [247]:
read_sample.size

2400

### Transform your DataFrame from a wide to a long format
1. Wide format is when you've a lot of cols
2. Long format on the other side is when you've lot of rows

Pandas **.melt**() is perfect candidate for this task

In [248]:
candidates= {
    'Name':["Emily","Adrian","Gabriel","Cindy"],
    'ID': [1, 2, 3, 4],
    '2017':[85, 87, 89, 91],
    '2018':[96, 98, 100, 102],
    '2019':[100, 102, 106, 106],
    '2020':[89, 95, 98, 100],
    '2021':[94, 96, 98, 100],
    '2022':[100, 104, 104, 107],
          }

In [249]:
# Data in wide format
salary_data = pd.DataFrame(candidates)

In [250]:
salary_data

Unnamed: 0,Name,ID,2017,2018,2019,2020,2021,2022
0,Emily,1,85,96,100,89,94,100
1,Adrian,2,87,98,102,95,96,104
2,Gabriel,3,89,100,106,98,98,104
3,Cindy,4,91,102,106,100,100,107


In [251]:
# Transform into the long format
long_format_data = salary_data.melt(id_vars=['Name', 'ID'],
                                    var_name='Year',
                                    value_name='Salary(k$)')

In [252]:
long_format_data

Unnamed: 0,Name,ID,Year,Salary(k$)
0,Emily,1,2017,85
1,Adrian,2,2017,87
2,Gabriel,3,2017,89
3,Cindy,4,2017,91
4,Emily,1,2018,96
5,Adrian,2,2018,98
6,Gabriel,3,2018,100
7,Cindy,4,2018,102
8,Emily,1,2019,100
9,Adrian,2,2019,102


### Reduce the size of your Pandas DataFrame by ignoring the index

In [None]:
URL = "https://raw.githubusercontent.com/scpike/us-state-county-zip/master/geo-data.csv"
data = pd.read_csv(URL)

In [None]:
# Create large data by repeating each row 10000 times
large_data = data.loc[data.index.repeat(10000)]

In [None]:
# Save with INDEX
large_data.to_csv("large_data_with_index.csv")

In [None]:
# Check the size of the file
!ls -GFlash large_data_with_index.csv

In [None]:
# Save without INDEX
large_data.to_csv("large_data_without_index.csv", index=False)

In [None]:
# Check the size of the file
!ls -GFlash large_data_without_index.csv

### Parquet instead of CSV
1. Processing speed
2. Speed in saving and loading
3. Disk space occupied by the data frame

In [None]:
URL = "https://raw.githubusercontent.com/scpike/us-state-county-zip/master/geo-data.csv"
df = pd.read_csv(URL)
# Create large data for experimentation by repeating each row 20000 times
exp_data = df.loc[data.index.repeat(20000)]

In [None]:
# Experiment with .csv format
# Write Time
%%time 
exp_data.to_csv("exp_data.csv", index=False)

# Read Time
%%time
csv_data = pd.read_csv("exp_data.csv")

# File Size
!ls -GFlash exp_data.csv

In [None]:
# Experiment with .parquet format
# Write Time
%%time 
exp_data.to_parquet('exp_data.parquet')

# Read Time
%%time 
parquet_data = pd.read_parquet('exp_data.parquet')

# File Size
!ls -GFlash exp_data.parquet   

### Transform your data frame into a markdown
One way of doing that is to render it in a markdown format using **.to_markdown**() function

In [253]:
URL = "https://raw.githubusercontent.com/scpike/us-state-county-zip/master/geo-data.csv"
df = pd.read_csv(URL)
head_df = df.head()

In [254]:
print(head_df)

   state_fips    state state_abbr zipcode      county        city
0           1  Alabama         AL   35004   St. Clair       Acmar
1           1  Alabama         AL   35005   Jefferson  Adamsville
2           1  Alabama         AL   35006   Jefferson       Adger
3           1  Alabama         AL   35007      Shelby    Keystone
4           1  Alabama         AL   35010  Tallapoosa    New site


In [255]:
print(head_df.to_markdown(tablefmt="grid"))

+----+--------------+---------+--------------+-----------+------------+------------+
|    |   state_fips | state   | state_abbr   |   zipcode | county     | city       |
|  0 |            1 | Alabama | AL           |     35004 | St. Clair  | Acmar      |
+----+--------------+---------+--------------+-----------+------------+------------+
|  1 |            1 | Alabama | AL           |     35005 | Jefferson  | Adamsville |
+----+--------------+---------+--------------+-----------+------------+------------+
|  2 |            1 | Alabama | AL           |     35006 | Jefferson  | Adger      |
+----+--------------+---------+--------------+-----------+------------+------------+
|  3 |            1 | Alabama | AL           |     35007 | Shelby     | Keystone   |
+----+--------------+---------+--------------+-----------+------------+------------+
|  4 |            1 | Alabama | AL           |     35010 | Tallapoosa | New site   |
+----+--------------+---------+--------------+-----------+-------

### Format Date Time Column
Specify the target column in the **parse_dates** argument to get the correct column type

In [256]:
df_info = {
    'Name':["Emily","Lucas","Camile","Gabriel","Steven"],
    'Degree':['Master','Master','Bachelor','PhD','Master'],
    'From':['Arizona','Chicago','San Franciso','New York','Ohio'],
    'From_office(min)':[120,80,65,100,30],
    'application_date':[
                    '17/11/2022',
                    '23/09/2022',
                    '02/12/2021',
                    '25/08/2022',
                    '07/01/2022'
    ]
}

In [257]:
student_df = pd.DataFrame(df_info)

In [258]:
student_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Name              5 non-null      object
 1   Degree            5 non-null      object
 2   From              5 non-null      object
 3   From_office(min)  5 non-null      int64 
 4   application_date  5 non-null      object
dtypes: int64(1), object(4)
memory usage: 328.0+ bytes


In [259]:
student_df.to_csv("student.csv")

In [260]:
student_df = pd.read_csv("student.csv",
                          parse_dates=["application_date"])

In [261]:
student_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Unnamed: 0        5 non-null      int64         
 1   Name              5 non-null      object        
 2   Degree            5 non-null      object        
 3   From              5 non-null      object        
 4   From_office(min)  5 non-null      int64         
 5   application_date  5 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 368.0+ bytes
