# <span style="color:#873600; font-family: Trebuchet MS; font-size: 80px; font-weight: bold;">Feature Engineering</span>

    * Feature engineering is a ML technique that leverages data to create new variables that aren’t in the training set.
    * Goal is: simplifying and speeding up data transformations while also enhancing model accuracy.

# <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Feature Engineering Techniques for Machine Learning
</span>

     1. Imputation (Numerical and Categorical)
     2. Handling Outliers (Removal, Replacing values, Capping, Discretization)
     3. Log Transform
     4. One-hot encoding
     5. Scaling (Normalization, Standartization)
     6. Binning
      etc.

 <span style="color:#873600; font-family: Trebuchet MS, sans-serif; font-size: 50px; font-weight: bold;">Imputation
</span>

 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Dropping with threshold</span>

In [1]:
import pandas as pd

# Create a sample housing dataset with missing values
housing_data = pd.DataFrame({
    'area': [100, 150, None, 200, None, 180],
    'bedrooms': [3, 4, None, 2, 4, 3],
    'bathrooms': [7, None, None, None, None, None],
    'price': [None, None, None, 180000, 350000, 280000]
})

# Display the original housing_data DataFrame
print("Original DataFrame:")
print(housing_data)

# Set the threshold for missing value rate
threshold = 0.7

# Dropping columns with missing value rate higher than threshold
housing_data = housing_data[housing_data.columns[housing_data.isnull().mean() < threshold]]

# Dropping rows with missing value rate higher than threshold
housing_data = housing_data.loc[housing_data.isnull().mean(axis=1) < threshold]

# Display the updated housing_data DataFrame
print("\nUpdated DataFrame:")
print(housing_data)

Original DataFrame:
    area  bedrooms  bathrooms     price
0  100.0       3.0        7.0       NaN
1  150.0       4.0        NaN       NaN
2    NaN       NaN        NaN       NaN
3  200.0       2.0        NaN  180000.0
4    NaN       4.0        NaN  350000.0
5  180.0       3.0        NaN  280000.0

Updated DataFrame:
    area  bedrooms     price
0  100.0       3.0       NaN
1  150.0       4.0       NaN
3  200.0       2.0  180000.0
4    NaN       4.0  350000.0
5  180.0       3.0  280000.0


 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Numerical Imputation</span>

    * Imputation is a more preferable option rather than dropping because it preserves the data size/
    * I think the best imputation way is to use the medians of the columns, because they are more solid to outliers.

In [35]:
import pandas as pd
import numpy as np

# Create a sample scores dataset with missing values
scores_data = pd.DataFrame({
    'Math': [80, 90, np.nan, 70, 85],
    'Science': [75, np.nan, 85, 90, 80],
    'English': [np.nan, 75, 80, 85, np.nan]
})

# Display the original scores_data DataFrame
print("Original DataFrame:")
print(scores_data)

# Fill missing values with 0
scores_data = scores_data.fillna(0)

# Display the DataFrame after filling missing values with 0
print("\nDataFrame after filling missing values with 0:")
print(scores_data)

Original DataFrame:
   Math  Science  English
0  80.0     75.0      NaN
1  90.0      NaN     75.0
2   NaN     85.0     80.0
3  70.0     90.0     85.0
4  85.0     80.0      NaN

DataFrame after filling missing values with 0:
   Math  Science  English
0  80.0     75.0      0.0
1  90.0      0.0     75.0
2   0.0     85.0     80.0
3  70.0     90.0     85.0
4  85.0     80.0      0.0


In [66]:
# Create a sample scores dataset with missing values
scores_data = pd.DataFrame({
    'Math': [80, 90, np.nan, 70, 85],
    'Science': [75, np.nan, 85, 90, 80],
    'English': [np.nan, 75, 80, 85, np.nan]
})

# Display the original scores_data DataFrame
print("Original DataFrame:")
print(scores_data)

# Fill missing values with median of the columns
scores_data = scores_data.fillna(scores_data.median())

# Display the final DataFrame after filling missing values with medians
print("\nDataFrame after filling missing values with mean:")
print(scores_data)

Original DataFrame:
   Math  Science  English
0  80.0     75.0      NaN
1  90.0      NaN     75.0
2   NaN     85.0     80.0
3  70.0     90.0     85.0
4  85.0     80.0      NaN

DataFrame after filling missing values with mean:
   Math  Science  English
0  80.0     75.0     80.0
1  90.0     82.5     75.0
2  82.5     85.0     80.0
3  70.0     90.0     85.0
4  85.0     80.0     80.0


 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Numerical Imputation with sklearn</span>

In [67]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create a sample scores dataset with missing values
scores_data = pd.DataFrame({
    'Math': [80, 90, np.nan, 70, 85],
    'Science': [75, np.nan, 85, 90, 80],
    'English': [np.nan, 75, 80, 85, np.nan]
})

# Display the original scores_data DataFrame
print("Original DataFrame:")
print(scores_data)

# Create an instance of SimpleImputer with the "median" strategy
imputer = SimpleImputer(strategy='median')

# Fill missing values in the scores_data DataFrame using the imputer
scores_data = pd.DataFrame(imputer.fit_transform(scores_data), columns=scores_data.columns)

# Display the final DataFrame after filling missing values
print("\nDataFrame after filling missing values with median:")
print(scores_data)

Original DataFrame:
   Math  Science  English
0  80.0     75.0      NaN
1  90.0      NaN     75.0
2   NaN     85.0     80.0
3  70.0     90.0     85.0
4  85.0     80.0      NaN

DataFrame after filling missing values with median:
   Math  Science  English
0  80.0     75.0     80.0
1  90.0     82.5     75.0
2  82.5     85.0     80.0
3  70.0     90.0     85.0
4  85.0     80.0     80.0


 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Categorical Imputation</span>

    * Replacing the missing values with the mode in a column is a good option for handling categorical columns.|

In [45]:
import pandas as pd

# Create a sample feedback dataset with missing values
feedback_data = pd.DataFrame({
    'Sentiment': ['Positive', 'Negative', 'Positive', None, 'Positive', None, 'Neutral', 'Positive']
})

# Display the original feedback_data DataFrame
print("Original DataFrame:")
print(feedback_data)

# Fill missing values with the most frequent sentiment
feedback_data['Sentiment'].fillna(feedback_data['Sentiment'].value_counts().idxmax(), inplace=True)

# Display the DataFrame after filling missing values
print("\nDataFrame after filling missing values:")
print(feedback_data)

Original DataFrame:
  Sentiment
0  Positive
1  Negative
2  Positive
3      None
4  Positive
5      None
6   Neutral
7  Positive

DataFrame after filling missing values:
  Sentiment
0  Positive
1  Negative
2  Positive
3  Positive
4  Positive
5  Positive
6   Neutral
7  Positive


 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Categorical Imputation with sklearn</span>

In [57]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Create a sample feedback dataset with missing values
feedback_data = pd.DataFrame({
    'Sentiment': ['Positive', 'Negative', 'Positive', None, 'Positive', None, 'Neutral', 'Positive']
})

# Display the original feedback_data DataFrame
print("Original DataFrame:")
print(feedback_data)

# Create an instance of SimpleImputer with the "most_frequent" strategy
imputer = SimpleImputer(missing_values=None,strategy='most_frequent')

# Fill missing values in the "Sentiment" column using the imputer
feedback_data['Sentiment'] = imputer.fit_transform(feedback_data[['Sentiment']])

# Display the DataFrame after filling missing values
print("\nDataFrame after filling missing values:")
print(feedback_data)

Original DataFrame:
  Sentiment
0  Positive
1  Negative
2  Positive
3      None
4  Positive
5      None
6   Neutral
7  Positive

DataFrame after filling missing values:
  Sentiment
0  Positive
1  Negative
2  Positive
3  Positive
4  Positive
5  Positive
6   Neutral
7  Positive


 <span style="color:#873600; font-family: Trebuchet MS, sans-serif; font-size: 50px; font-weight: bold;">Binning
</span>

    * The main motivation of binning is to make the model more robust and prevent overfitting.
    * However, it has a cost to the performance.

 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Numerical Binning</span>

In [71]:
import pandas as pd

# Create a sample DataFrame
data = pd.DataFrame({
    'value': [15, 50, 80, 25, 90, 35]
})

# Display the original DataFrame
print("Original DataFrame:")
print(data)

# Create a new column 'bin' based on value ranges
data['bin'] = pd.cut(data['value'], bins=[0, 30, 70, 100], labels=["Low", "Mid", "High"])

# Display the DataFrame with the new 'bin' column
print("\nDataFrame with 'bin' column:")
print(data)

Original DataFrame:
   value
0     15
1     50
2     80
3     25
4     90
5     35

DataFrame with 'bin' column:
   value   bin
0     15   Low
1     50   Mid
2     80  High
3     25   Low
4     90  High
5     35   Mid


 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Bin continous data into intervals with sklearn</span>


In [72]:
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

# Create a sample DataFrame
data = pd.DataFrame({
    'value': [15, 50, 80, 25, 90, 35]
})

# Display the original DataFrame
print("Original DataFrame:")
print(data)

# Create an instance of KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')

# Fit and transform the 'value' column to create bins
bins = discretizer.fit_transform(data[['value']])

# Map bin indices to labels manually
labels = ["Low", "Mid", "High"]
data['bin'] = [labels[int(bin_index)] for bin_index in bins]

# Display the DataFrame with the new 'bin' column
print("\nDataFrame with 'bin' column:")
print(data)

Original DataFrame:
   value
0     15
1     50
2     80
3     25
4     90
5     35

DataFrame with 'bin' column:
   value   bin
0     15   Low
1     50   Mid
2     80  High
3     25   Low
4     90  High
5     35   Low


 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Binarize data according to a threshold with sklearn</span>


In [79]:
import warnings
import pandas as pd
from sklearn.preprocessing import Binarizer

# Create a sample DataFrame
data = pd.DataFrame({
    'value': [15, 50, 80, 25, 90, 35,67]
})

# Display the original DataFrame
print("Original DataFrame:")
print(data)

# Create an instance of Binarizer with threshold=50
binarizer = Binarizer(threshold=50)

# Binarize the 'value' column
binarized_values = binarizer.transform(data[['value']])

# Create a new column 'bin' with the binarized values
data['bin'] = binarized_values

# Display the DataFrame with the new 'bin' column
print("\nDataFrame with 'bin' column:")
print(data)

warnings.filterwarnings("error")

Original DataFrame:
   value
0     15
1     50
2     80
3     25
4     90
5     35
6     67

DataFrame with 'bin' column:
   value  bin
0     15    0
1     50    0
2     80    1
3     25    0
4     90    1
5     35    0
6     67    1


 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Categorical Binning</span>

In [76]:
import pandas as pd
import numpy as np


# Create a sample DataFrame
data = pd.DataFrame({
    'Country': ['Spain', 'Chile', 'Australia', 'Italy', 'Brazil']
})

# Display the original DataFrame
print("Original DataFrame:")
print(data)

# Define the conditions for each continent
conditions = [
    data['Country'].str.contains('Spain'),
    data['Country'].str.contains('Italy'),
    data['Country'].str.contains('Chile'),
    data['Country'].str.contains('Brazil')
]

# Define the corresponding choices for each condition
choices = ['Europe', 'Europe', 'South America', 'South America']

# Assign the continent based on the conditions and choices
data['Continent'] = np.select(conditions, choices, default='Other')

# Display the DataFrame with the new 'Continent' column
print("\nDataFrame with 'Continent' column:")
print(data)

Original DataFrame:
     Country
0      Spain
1      Chile
2  Australia
3      Italy
4     Brazil

DataFrame with 'Continent' column:
     Country      Continent
0      Spain         Europe
1      Chile  South America
2  Australia          Other
3      Italy         Europe
4     Brazil  South America


 <span style="color:#873600; font-family: Trebuchet MS, sans-serif; font-size: 50px; font-weight: bold;">Log Transform
</span>

    * It helps to handle skewed data and after transformation, the distribution becomes more approximate to normal.
    * It also decreases the effect of the outliers.
    * The data you apply log transform must have only positive values, otherwise you receive an error.
    * You can add 1 to your data before transform it

In [102]:
#Log Transform Example
data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})
data['ln(x+1)'] = (data['value']+1).transform(np.log)
#Negative Values Handling
#Note that the values are different
data[' ln(x-min(x)+1)'] = (data['value']-data['value'].min()+1) .transform(np.log)

In [103]:
data

Unnamed: 0,value,ln(x+1),ln(x-min(x)+1)
0,2,1.098612,3.258097
1,45,3.828641,4.234107
2,-23,,0.0
3,85,4.454347,4.691348
4,28,3.367296,3.951244
5,2,1.098612,3.258097
6,35,3.583519,4.077537
7,-12,,2.484907


 <span style="color:#873600; font-family: Trebuchet MS, sans-serif; font-size: 50px; font-weight: bold;">Label Encoding
</span>

    * Label encoding is a process of converting categorical variables into numerical values.
    

In [1]:
from sklearn.preprocessing import LabelEncoder

# Sample data
categories = ['red', 'blue', 'green', 'yellow', 'red', 'blue']

# Initialize LabelEncoder
encoder = LabelEncoder()

# Fit and transform the categories
encoded_categories = encoder.fit_transform(categories)

# Print the encoded categories
print(encoded_categories)

[2 0 1 3 2 0]


In [2]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample DataFrame
data = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'yellow', 'red', 'blue'],
    'size': ['small', 'medium', 'large', 'large', 'medium', 'small'],
    'shape': ['circle', 'square', 'circle', 'square', 'circle', 'square']
})

# Initialize LabelEncoder
encoder = LabelEncoder()

# Iterate over each column in the DataFrame
for column in data.columns:
    # Check if the column data type is object (categorical)
    if data[column].dtype == 'object':
        # Fit and transform the column
        data[column] = encoder.fit_transform(data[column])

# Print the modified DataFrame
print(data)

   color  size  shape
0      2     2      0
1      0     1      1
2      1     0      0
3      3     0      1
4      2     1      0
5      0     2      1


 <span style="color:#873600; font-family: Trebuchet MS, sans-serif; font-size: 50px; font-weight: bold;">One Hot Encoding
</span>

    * This method spreads the values in a column to multiple flag columns and assigns 0 or 1 to them/

In [3]:
import pandas as pd

# Sample DataFrame
data = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'yellow', 'red', 'blue'],
    'size': ['small', 'medium', 'large', 'large', 'medium', 'small'],
    'shape': ['circle', 'square', 'circle', 'square', 'circle', 'square']
})

# Perform one-hot encoding using get_dummies function
encoded_data = pd.get_dummies(data)

# Print the encoded DataFrame
print(encoded_data)

   color_blue  color_green  color_red  color_yellow  size_large  size_medium  \
0           0            0          1             0           0            0   
1           1            0          0             0           0            1   
2           0            1          0             0           1            0   
3           0            0          0             1           1            0   
4           0            0          1             0           0            1   
5           1            0          0             0           0            0   

   size_small  shape_circle  shape_square  
0           1             1             0  
1           0             0             1  
2           0             1             0  
3           0             0             1  
4           0             1             0  
5           1             0             1  


In [105]:
import pandas as pd

# Example dataset
data = pd.DataFrame({'column': ['A', 'B', 'C', 'A', 'B', 'C']})

# Perform one-hot encoding
encoded_columns = pd.get_dummies(data['column'])

# Join the encoded columns back to the original data and drop the original column
data = data.join(encoded_columns).drop('column', axis=1)

print(data)

   A  B  C
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
4  0  1  0
5  0  0  1


In [4]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample DataFrame
data = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'yellow', 'red', 'blue'],
    'size': ['small', 'medium', 'large', 'large', 'medium', 'small'],
    'shape': ['circle', 'square', 'circle', 'square', 'circle', 'square']
})

# Initialize OneHotEncoder
encoder = OneHotEncoder()

# Perform one-hot encoding
encoded_data = encoder.fit_transform(data)

# Create a new DataFrame with the encoded data
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(data.columns))

# Print the encoded DataFrame
print(encoded_df)

   color_blue  color_green  color_red  color_yellow  size_large  size_medium  \
0         0.0          0.0        1.0           0.0         0.0          0.0   
1         1.0          0.0        0.0           0.0         0.0          1.0   
2         0.0          1.0        0.0           0.0         1.0          0.0   
3         0.0          0.0        0.0           1.0         1.0          0.0   
4         0.0          0.0        1.0           0.0         0.0          1.0   
5         1.0          0.0        0.0           0.0         0.0          0.0   

   size_small  shape_circle  shape_square  
0         1.0           1.0           0.0  
1         0.0           0.0           1.0  
2         0.0           1.0           0.0  
3         0.0           0.0           1.0  
4         0.0           1.0           0.0  
5         1.0           0.0           1.0  


In [3]:
# Sample DataFrame
data = {
    'Color': ['Red', 'Green', 'Blue', 'Blue', 'Red'],
    'Size': ['S', 'M', 'L', 'S', 'M']
}

df_1 = pd.DataFrame(data)

# Applying one-hot encoding with drop_first=True
encoded_df = pd.get_dummies(df_1, drop_first=True)

print("Original DataFrame:")
print(df_1)

print("\nOne-hot Encoded DataFrame:")
print(encoded_df)

Original DataFrame:
   Color Size
0    Red    S
1  Green    M
2   Blue    L
3   Blue    S
4    Red    M

One-hot Encoded DataFrame:
   Color_Green  Color_Red  Size_M  Size_S
0            0          1       0       1
1            1          0       1       0
2            0          0       0       0
3            0          0       0       1
4            0          1       1       0


 <span style="color:#873600; font-family: Trebuchet MS, sans-serif; font-size: 50px; font-weight: bold;">Feature Split
</span>

In [5]:
import pandas as pd

# Sample DataFrame
data = pd.DataFrame({
    'name': ['Luther N. Gonzalez', 'Charles M. Young', 'Terry Lawson', 'Kristen White', 'Thomas Logsdon']
})

# Extracting first names
data['first_name'] = data['name'].str.split(" ").map(lambda x: x[0])

# Extracting last names
data['last_name'] = data['name'].str.split(" ").map(lambda x: x[-1])

# Print the updated DataFrame
print(data)

                 name first_name last_name
0  Luther N. Gonzalez     Luther  Gonzalez
1    Charles M. Young    Charles     Young
2        Terry Lawson      Terry    Lawson
3       Kristen White    Kristen     White
4      Thomas Logsdon     Thomas   Logsdon


In [22]:
import pandas as pd

# Sample DataFrame
data = pd.DataFrame({
    'title': ['Toy Story (1995)', 'Jumanji (1995)', 'Grumpier Old Men (1995)', 'Waiting to Exhale (1995)']
})

# Extracting the year from the title
data['year'] = data['title'].str.split("(", n=1, expand=True)[1].str.split(")", n=1, expand=True)[0]

# Changing the data type to integer
data['year'] = data['year'].astype(int)

# Print the updated DataFrame with the corresponding data type
print(data.dtypes)
print(data)

title    object
year      int32
dtype: object
                      title  year
0          Toy Story (1995)  1995
1            Jumanji (1995)  1995
2   Grumpier Old Men (1995)  1995
3  Waiting to Exhale (1995)  1995


 <span style="color:#873600; font-family: Trebuchet MS, sans-serif; font-size: 50px; font-weight: bold;">Scaling
</span>

 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Normalization</span>
 
    * Normalization (or min-max normalization) scale all values in a fixed range between 0 and 1.
    * Before normalization, it is recommended to handle the outliers.

Xnorm = X - Xmin / Xmax - Xmin

In [8]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample DataFrame
data = pd.DataFrame({
    'feature1': [10, 20, 30, 40, 50],
    'feature2': [5, 15, 25, 35, 45]
})

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Perform normalization
normalized_data = scaler.fit_transform(data)

# Create a new DataFrame with the normalized data
normalized_df = pd.DataFrame(normalized_data, columns=data.columns)

# Print the normalized DataFrame
print(normalized_df)

   feature1  feature2
0      0.00      0.00
1      0.25      0.25
2      0.50      0.50
3      0.75      0.75
4      1.00      1.00


In [9]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Sample DataFrame
data = pd.DataFrame({
    'feature1': [10, 20, 30, 40, 50],
    'feature2': [5, 15, 25, 35, 45]
})

# Initialize StandardScaler
scaler = StandardScaler()

# Perform standardization
standardized_data = scaler.fit_transform(data)

# Create a new DataFrame with the standardized data
standardized_df = pd.DataFrame(standardized_data, columns=data.columns)

# Print the standardized DataFrame
print(standardized_df)

   feature1  feature2
0 -1.414214 -1.414214
1 -0.707107 -0.707107
2  0.000000  0.000000
3  0.707107  0.707107
4  1.414214  1.414214


 <span style="color:#A04000; font-family: Trebuchet MS, sans-serif; font-size: 30px; font-weight: bold;">Robust Scaler</span>
 
    * The RobustScaler uses statistics that are not influenced by outliers
    * This Scaler removes the median and scales the data according to the quantile range.

In [11]:
import pandas as pd
from sklearn.preprocessing import RobustScaler

# Sample DataFrame
data = pd.DataFrame({
    'feature1': [10, 20, 30, 40, 50],
    'feature2': [5, 15, 25, 35, 45]
})

# Initialize RobustScaler
scaler = RobustScaler()

# Perform feature scaling
scaled_data = scaler.fit_transform(data)

# Create a new DataFrame with the scaled data
scaled_df = pd.DataFrame(scaled_data, columns=data.columns)

# Print the scaled DataFrame
print(scaled_df)

   feature1  feature2
0      -1.0      -1.0
1      -0.5      -0.5
2       0.0       0.0
3       0.5       0.5
4       1.0       1.0


 <span style="color:#873600; font-family: Trebuchet MS, sans-serif; font-size: 50px; font-weight: bold;">Extracting Date
</span>

    * Extracting the parts of the date into different columns: Year, month, day, etc
    * Extracting the time period between the current date and columns in terms of years, months, days, etc.
    * Extracting some specific features from the date: Name of the weekday, Weekend or not, holiday or not, etc

In [17]:
from datetime import date

data = pd.DataFrame({'date':
['01-01-2017',
'04-12-2008',
'23-06-1988',
'25-08-1999',
'20-02-1993',
]})

#Transform string to date
data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")

#Extracting Year
data['year'] = data['date'].dt.year

#Extracting Month
data['month'] = data['date'].dt.month

#Extracting passed years since the date
data['passed_years'] = date.today().year - data['date'].dt.year

#Extracting passed months since the date
data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month - data['date'].dt.month

#Extracting the weekday name of the date
data['day_name'] = data['date'].dt.day_name()

In [18]:
data

Unnamed: 0,date,year,month,passed_years,passed_months,day_name
0,2017-01-01,2017,1,6,78,Sunday
1,2008-12-04,2008,12,15,175,Thursday
2,1988-06-23,1988,6,35,421,Thursday
3,1999-08-25,1999,8,24,287,Wednesday
4,1993-02-20,1993,2,30,365,Saturday


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           5 non-null      datetime64[ns]
 1   year           5 non-null      int64         
 2   month          5 non-null      int64         
 3   passed_years   5 non-null      int64         
 4   passed_months  5 non-null      int64         
 5   day_name       5 non-null      object        
dtypes: datetime64[ns](1), int64(4), object(1)
memory usage: 368.0+ bytes


 <span style="color:#873600; font-family: Trebuchet MS, sans-serif; font-size: 50px; font-weight: bold;">Duplicates
</span>

In [19]:
import pandas as pd

# Sample DataFrame with duplicate values
data = pd.DataFrame({
    'ID': [1, 2, 3, 4, 1, 2],
    'Name': ['John', 'Alice', 'Bob', 'Alice', 'John', 'Bob'],
    'Age': [25, 30, 35, 30, 25, 35]
})

# Detect duplicate rows
duplicate_rows = data.duplicated()

# Print the duplicate rows
print("Duplicate rows:")
print(data[duplicate_rows])

# Drop duplicate rows
data_without_duplicates = data.drop_duplicates()

# Print the DataFrame without duplicates
print("DataFrame without duplicates:")
print(data_without_duplicates)

Duplicate rows:
   ID  Name  Age
4   1  John   25
DataFrame without duplicates:
   ID   Name  Age
0   1   John   25
1   2  Alice   30
2   3    Bob   35
3   4  Alice   30
5   2    Bob   35


In [21]:
data = pd.DataFrame({
    'ID': [1, 2, 3, 4, 1, 2],
    'Name': ['John', 'Alice', 'Bob', 'Alice', 'John', 'Bob'],
    'Age': [25, 30, 35, 30, 25, 35]
})
data.duplicated().sum()

1