##### Import Libraries
Importing libraries is a fundamental concept in programming and data analysis, 
Libraries are collections of pre-written code, functions, and modules that provide a wide range of functionality


##### Most Used Libraries 
- NumPy: For numerical and mathematical operations, including arrays and matrices.
- pandas: For data manipulation and analysis.
- Matplotlib and Seaborn: For data visualization and plotting.
- scikit-learn: For machine learning and data mining.

##### import library_name as alias
- import libraries or modules and give them an alias or a shorter name for convenience. This is particularly useful when working with libraries or modules with long names, as it can make your code more concise and readable.

In [1]:
import numpy as np
import pandas as pd

#### 1- Read CSV File
- CSV (Comma-Separated Values) values (columns) separated by comma
- Separator (sep or delimiter): if file not seperated by another delimeter (tab or ... )

    Defines the character or sequence of characters used to separate values in the CSV file.
    ```
    df = pd.read_csv('data.tsv', sep='\t')
    ``` 
- Header Row (header):

    Specifies which row to use as the column names. 
    
    You can set it to None if there is no header row.

    Use the first row as column names
    ``` 
    df = pd.read_csv('data.csv', header=0)
    ``` 

     No header row, set column names manually
     ``` 
    df = pd.read_csv('data.csv', header=None, names=['A', 'B', 'C'])
    ``` 

- Index Column (index_col): as ID for each row

    Specifies the column to use as the DataFrame's index.
    
     It can be a column name or index position.
     ```
     df = pd.read_csv('data.csv', index_col='ID')
     ```

In [179]:
# Read a CSV file from a local path (if file same folder of notebook) write name of file and it's extension
df = pd.read_csv('train.csv')

#### 2- Data Exploration

    I. info() 
    method to get a concise summary of the DataFrame, including data types, non-null values, and memory usage.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


    II. shape
     attribute to get the number of rows and columns in the DataFrame.

In [6]:
# Get the number of rows and columns
num_rows, num_columns = df.shape
print("number of rows : " , num_rows)
print("number of columns : " , num_columns)


number of rows :  891
number of columns :  12


    III. columns 
    attribute to get the column names.

In [7]:
# Get column names
column_names = df.columns
print(column_names)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


    IV. Head and Tail
    head() to display the first few rows of the DataFrame 
    tail() to display the last few rows.

In [9]:
# Display the first 5 rows
df_head = df.head()
print(df_head)

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [10]:
# Display the last 3 rows
df_tail = df.tail(3)
print(df_tail)

     PassengerId  Survived  Pclass                                      Name  \
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex   Age  SibSp  Parch      Ticket   Fare Cabin Embarked  
888  female   NaN      1      2  W./C. 6607  23.45   NaN        S  
889    male  26.0      0      0      111369  30.00  C148        C  
890    male  32.0      0      0      370376   7.75   NaN        Q  


    V. Descriptive Statistics :  describe() 
    method to generate basic statistics for each numeric column in the DataFrame, such as count, mean, standard deviation, and quartiles.

In [11]:
# Generate descriptive statistics for numeric columns
stats = df.describe()
print(stats)

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


    VI. Data Types : dtypes 
    to view the data types of each column

In [12]:
# Get data types of columns
data_types = df.dtypes
data_types #if write variable last line will print it

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

    VII. Null Values
    .isna() or .isnull() to check for missing values in the DataFrame,
    and .sum() to count them.

In [14]:
# Check for missing values
missing_values_data = df.isna()
missing_values_data # Null value will write True


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [16]:
# Check for missing values
missing_values_Count = df.isna().sum()
missing_values_Count 


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [19]:
# Check for missing values "can specify column"
Cabin_Column_missing_values_Count = df["Cabin"].isna().sum()
Cabin_Column_missing_values_Count 


687

    VIII. Unique Values
    nunique() to count the number of unique values in each column, 
    unique() to see the unique values.

In [20]:
# Count unique values in each column
unique_counts = df.nunique()
unique_counts



PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [25]:
# Get Count unique values in a specific column
unique_counts_column = df['Age'].nunique()
unique_counts_column

88

In [27]:
# unique values of column
unique_values = df["Sex"].unique()
unique_values



array(['male', 'female'], dtype=object)

    IX. Value Counts : value_counts()
     count the occurrences of unique values in a column.

In [28]:
# Count occurrences of unique values in a column
value_counts_sex = df['Sex'].value_counts()
value_counts_survived = df['Survived'].value_counts()

print(value_counts_sex)
print(value_counts_survived)


male      577
female    314
Name: Sex, dtype: int64
0    549
1    342
Name: Survived, dtype: int64


    X. Correlation : corr()
    to compute the pairwise correlation between numeric columns in the DataFrame.

In [29]:
# Calculate correlation between columns
correlation_matrix = df.corr()
correlation_matrix

  correlation_matrix = df.corr()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


    XI. Measures of Central Tendency
    Calculate the mean (average), 
    median (middle value),  
    mode (most frequent value) 
    max
    min
    variance
    Standard deviation
    

In [58]:
mean = df['Fare'].mean()
mean


32.204207968574636

In [59]:
median = df['Fare'].median()
median

14.4542

In [60]:
mode = df['Fare'].mode().values[0]
mode

8.05

In [62]:
min = df['Fare'].min()
min

0.0

In [63]:
max = df['Fare'].max()
max

512.3292

In [65]:
variance = df['Fare'].var()
variance

2469.436845743117

In [64]:
Standard_dev = df['Fare'].std()
Standard_dev

49.693428597180905

    XII.Get Percentage

In [180]:
women = df.loc[df.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

% of women who survived: 0.7420382165605095


- Can get column of dataframe 
    - df.Sex --> df.ColumnName
    - df["Sex"] --> df["ColumnName"]

#### 3.Grouping and Aggregation
essential techniques in data analysis that allow you to organize data into groups and calculate summary statistics or apply functions to each group


    I.Grouping Data : groupby()
    You can use the groupby() method to group data based on one or more columns in your DataFrame. 
    ** After Group apply Aggregation ** 

In [34]:
# Group data by the 'Sex' column
grouped_by_Survived = df.groupby('Survived')
grouped_by_Survived

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002567C8B1610>

    can group by multiple columns to create a hierarchical grouping.

In [35]:
# Group data by multiple columns
multi_grouped = df.groupby(['Survived', 'Pclass'])


    II. Aggregation Functions:
    Once you've grouped the data, you can apply aggregation functions to compute summary statistics for each group.

sum(): Calculate the sum of values in each group.

In [39]:
sum_by_group = grouped_by_Survived.sum()
sum_by_group

  sum_by_group = grouped_by_Survived.sum()


Unnamed: 0_level_0,PassengerId,Pclass,Age,SibSp,Parch,Fare
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,245412,1390,12985.5,304,181,12142.7199
1,151974,667,8219.67,162,159,16551.2294


In [40]:
# Can specify column
sum_by_group = grouped_by_Survived["Pclass"].sum()
sum_by_group

Survived
0    1390
1     667
Name: Pclass, dtype: int64

In [42]:
# get sum of Group data by multiple columns
sum_by_group = multi_grouped.sum()
sum_by_group

  sum_by_group = multi_grouped.sum()


Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Age,SibSp,Parch,Fare
Survived,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1,32824,2796.5,23,24,5174.7206
0,2,43856,3019.0,31,14,1882.9958
0,3,168732,7170.0,250,143,5085.0035
1,1,66881,4314.92,67,53,13002.6919
1,2,38200,2149.83,43,56,1918.8459
1,3,46893,1754.92,52,50,1629.6916


Same Operations can applied on different Aggregation Functions

mean(): Calculate the mean (average) of values in each group.

In [45]:
mean_by_group = grouped_by_Survived.mean()
print(mean_by_group)

          PassengerId    Pclass        Age     SibSp     Parch       Fare
Survived                                                                 
0          447.016393  2.531876  30.626179  0.553734  0.329690  22.117887
1          444.368421  1.950292  28.343690  0.473684  0.464912  48.395408


  mean_by_group = grouped_by_Survived.mean()


max() and min(): Find the maximum and minimum values in each group.

In [48]:
max_by_group = grouped_by_Survived["Fare"].max()
min_by_group = grouped_by_Survived["Fare"].min()
print(max_by_group ,"\n", min_by_group)

Survived
0    263.0000
1    512.3292
Name: Fare, dtype: float64 
 Survived
0    0.0
1    0.0
Name: Fare, dtype: float64


count(): Count the number of items in each group.

In [50]:
count_by_group = grouped_by_Survived.count()
count_by_group

Unnamed: 0_level_0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,549,549,549,549,424,549,549,549,549,68,549
1,342,342,342,342,290,342,342,342,342,136,340


    III. Combining GroupBy and Aggregation:
    You can perform both grouping and aggregation in a single step.

In [52]:
# Calculate the sum and mean for each group
result = df.groupby('Survived')['Pclass'].agg(['sum', 'mean'])
result

Unnamed: 0_level_0,sum,mean
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1390,2.531876
1,667,1.950292


In [53]:
# Calculate the sum and mean for each group
result = df.groupby('Survived').agg(['sum', 'mean'])
result

  result = df.groupby('Survived').agg(['sum', 'mean'])


Unnamed: 0_level_0,PassengerId,PassengerId,Pclass,Pclass,Age,Age,SibSp,SibSp,Parch,Parch,Fare,Fare
Unnamed: 0_level_1,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean,sum,mean
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
0,245412,447.016393,1390,2.531876,12985.5,30.626179,304,0.553734,181,0.32969,12142.7199,22.117887
1,151974,444.368421,667,1.950292,8219.67,28.34369,162,0.473684,159,0.464912,16551.2294,48.395408


    IV. Aggregating Multiple Columns:
    You can apply aggregation to multiple columns simultaneously.

In [55]:
# Calculate the sum and mean
result = df.groupby('Survived')[['Pclass', 'Fare']].agg(['sum', 'mean'])
result

Unnamed: 0_level_0,Pclass,Pclass,Fare,Fare
Unnamed: 0_level_1,sum,mean,sum,mean
Survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,1390,2.531876,12142.7199,22.117887
1,667,1.950292,16551.2294,48.395408


    V. Grouping Multiple Columns and Aggregating Multiple Columns:
    You can apply Grouping and aggregation to multiple columns simultaneously.

In [57]:
# Calculate the sum and mean
result = df.groupby(['Survived','Pclass']).agg(['sum', 'mean',"min"])
result

  result = df.groupby(['Survived','Pclass']).agg(['sum', 'mean',"min"])


Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,PassengerId,PassengerId,Age,Age,Age,SibSp,SibSp,SibSp,Parch,Parch,Parch,Fare,Fare,Fare
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,min,sum,mean,min,sum,mean,min,sum,mean,min,sum,mean,min
Survived,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0,1,32824,410.3,7,2796.5,43.695312,2.0,23,0.2875,0,24,0.3,0,5174.7206,64.684007,0.0
0,2,43856,452.123711,21,3019.0,33.544444,16.0,31,0.319588,0,14,0.14433,0,1882.9958,19.412328,0.0
0,3,168732,453.580645,1,7170.0,26.555556,1.0,250,0.672043,0,143,0.384409,0,5085.0035,13.669364,0.0
1,1,66881,491.772059,2,4314.92,35.368197,0.92,67,0.492647,0,53,0.389706,0,13002.6919,95.608029,25.9292
1,2,38200,439.08046,10,2149.83,25.901566,0.67,43,0.494253,0,56,0.643678,0,1918.8459,22.0557,10.5
1,3,46893,394.058824,3,1754.92,20.646118,0.42,52,0.436975,0,50,0.420168,0,1629.6916,13.694887,0.0


#### 4.Filtering data 
Filter data in a DataFrame based on specific values and conditions is a fundamental data manipulation task in data analysis. 

filter rows based on values and conditions.

    I.Filter Rows Based on a Single Condition
    You can filter rows based on a single condition by specifying the condition within square brackets.

In [69]:
# Filter rows where 'Age' is equal to 25
age_25 = df[df['Age'] == 25]
age_25

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
75,76,0,3,"Moen, Mr. Sigurd Hansen",male,25.0,0,0,348123,7.65,F G73,S
134,135,0,2,"Sobey, Mr. Samuel James Hayden",male,25.0,0,0,C.A. 29178,13.0,,S
246,247,0,3,"Lindahl, Miss. Agda Thorilda Viktoria",female,25.0,0,0,347071,7.775,,S
267,268,1,3,"Persson, Mr. Ernst Ulrik",male,25.0,1,0,347083,7.775,,S
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S
343,344,0,2,"Sedgwick, Mr. Charles Frederick Waddington",male,25.0,0,0,244361,13.0,,S
353,354,0,3,"Arnold-Franchi, Mr. Josef",male,25.0,1,0,349237,17.8,,S
370,371,1,1,"Harder, Mr. George Achilles",male,25.0,1,0,11765,55.4417,E50,C
442,443,0,3,"Petterson, Mr. Johan Emil",male,25.0,1,0,347076,7.775,,S
484,485,1,1,"Bishop, Mr. Dickinson H",male,25.0,1,0,11967,91.0792,B49,C


In [71]:
# can print number of records (rows) that specify condition
# also can apply any group and aggregation 
age_25.count()
# cabin = 4 because many null values

PassengerId    23
Survived       23
Pclass         23
Name           23
Sex            23
Age            23
SibSp          23
Parch          23
Ticket         23
Fare           23
Cabin           4
Embarked       23
dtype: int64

    II. Filter Rows Based on Multiple Conditions:

    You can filter rows based on multiple conditions by combining them using logical operators like & (and) and | (or).

In [73]:
# Filter rows where 'Age' is 25 and 'pclass' is 3
selected_rows = df[(df['Age'] == 25) & (df['Pclass'] == 3)]
selected_rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
75,76,0,3,"Moen, Mr. Sigurd Hansen",male,25.0,0,0,348123,7.65,F G73,S
246,247,0,3,"Lindahl, Miss. Agda Thorilda Viktoria",female,25.0,0,0,347071,7.775,,S
267,268,1,3,"Persson, Mr. Ernst Ulrik",male,25.0,1,0,347083,7.775,,S
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S
353,354,0,3,"Arnold-Franchi, Mr. Josef",male,25.0,1,0,349237,17.8,,S
442,443,0,3,"Petterson, Mr. Johan Emil",male,25.0,1,0,347076,7.775,,S
693,694,0,3,"Saad, Mr. Khalil",male,25.0,0,0,2672,7.225,,C
703,704,0,3,"Gallagher, Mr. Martin",male,25.0,0,0,36864,7.7417,,Q
729,730,0,3,"Ilmakangas, Miss. Pieta Sofia",female,25.0,1,0,STON/O2. 3101271,7.925,,S
784,785,0,3,"Ali, Mr. William",male,25.0,0,0,SOTON/O.Q. 3101312,7.05,,S


In [74]:
# Filter rows where 'Age' is 25 and ('Embarked' is 'C' or 'Q' )
selected_rows = df[(df['Age'] == 25) & ((df['Embarked'] == 'C') | (df['Embarked'] == 'Q'))]  # string value put it between ''
selected_rows


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
370,371,1,1,"Harder, Mr. George Achilles",male,25.0,1,0,11765,55.4417,E50,C
484,485,1,1,"Bishop, Mr. Dickinson H",male,25.0,1,0,11967,91.0792,B49,C
685,686,0,2,"Laroche, Mr. Joseph Philippe Lemercier",male,25.0,1,2,SC/Paris 2123,41.5792,,C
693,694,0,3,"Saad, Mr. Khalil",male,25.0,0,0,2672,7.225,,C
703,704,0,3,"Gallagher, Mr. Martin",male,25.0,0,0,36864,7.7417,,Q


In [75]:
# Filter rows where 'Age' is 25 or 23
selected_rows = df[(df['Age'] == 25) | (df['Age'] == 23)]
selected_rows


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
75,76,0,3,"Moen, Mr. Sigurd Hansen",male,25.0,0,0,348123,7.65,F G73,S
88,89,1,1,"Fortune, Miss. Mabel Helen",female,23.0,3,2,19950,263.0,C23 C25 C27,S
97,98,1,1,"Greenfield, Mr. William Bertram",male,23.0,0,1,PC 17759,63.3583,D10 D12,C
134,135,0,2,"Sobey, Mr. Samuel James Hayden",male,25.0,0,0,C.A. 29178,13.0,,S
135,136,0,2,"Richard, Mr. Emile",male,23.0,0,0,SC/PARIS 2133,15.0458,,C
246,247,0,3,"Lindahl, Miss. Agda Thorilda Viktoria",female,25.0,0,0,347071,7.775,,S
267,268,1,3,"Persson, Mr. Ernst Ulrik",male,25.0,1,0,347083,7.775,,S
271,272,1,3,"Tornquist, Mr. William Henry",male,25.0,0,0,LINE,0.0,,S
343,344,0,2,"Sedgwick, Mr. Charles Frederick Waddington",male,25.0,0,0,244361,13.0,,S
350,351,0,3,"Odahl, Mr. Nils Martin",male,23.0,0,0,7267,9.225,,S


- Can use any comparison operation >= , <= , < , > , != , ==

In [76]:
# Filter rows where 'Age' smaller than or equal 25
selected_rows = df[(df['Age'] <= 25)]  
selected_rows


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
877,878,0,3,"Petroff, Mr. Nedelio",male,19.0,0,0,349212,7.8958,,S
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S


In [77]:
# Filter rows where 'Age' is bigger than or equal 25
selected_rows = df[(df['Age'] >= 25)]  
selected_rows


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [79]:
# Filter rows where 'Age' is not 25
selected_rows = df[(df['Age'] != 25)]  # != means not equal (all values except 25)
selected_rows


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [88]:
# or 
# Filter rows where 'Age' is not equal to 25
not_age_25 = df[~(df['Age'] == 25)]
not_age_25


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [80]:
# Filter rows where 'Age' is smaller 25
selected_rows = df[(df['Age'] < 25)]  
selected_rows


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
876,877,0,3,"Gustafsson, Mr. Alfred Ossian",male,20.0,0,0,7534,9.8458,,S
877,878,0,3,"Petroff, Mr. Nedelio",male,19.0,0,0,349212,7.8958,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S


    III.Filter Rows with Specific Values Using .isin():
    You can filter rows where a column's values match a list of specific values using the .isin() method.

In [82]:
# Filter rows where 'pclass' is either 1 or 2
specific_pclass = df[df['Pclass'].isin([1, 2])]
specific_pclass

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,0,0,C.A./SOTON 34068,10.5000,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


    IV. Filter Rows Based on String Matching: .str.contains().
    To filter rows based on partial string matching

In [84]:
# Filter rows where 'Sex' contains 'female'
female_data = df[df['Sex'].str.contains('female')]
female_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
...,...,...,...,...,...,...,...,...,...,...,...,...
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


    V.Filter Rows with Null (NaN) Values: .isna() or .isnull()
    You can filter rows that have null values in a specific column

In [85]:
# Filter rows where 'Age' is missing (null)
missing_age = df[df['Age'].isna()]
missing_age

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


    VI.Filter Rows Based on Index:
    You can also filter rows based on their index location using .loc[].

In [90]:
# Filter rows with index 0 and 2
selected_rows = df.loc[[0, 2]]
selected_rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


#### 5.Operations on string columns
apply methods to manipulate and extract information from string values in DataFrame columns

    I.Uppercase and Lowercase
    .str.upper() and .str.lower() to change the case of strings in a column.

In [92]:
# Convert Sex to uppercase
uppercase_sex = df['Sex'].str.upper()
uppercase_sex


0        MALE
1      FEMALE
2      FEMALE
3      FEMALE
4        MALE
        ...  
886      MALE
887    FEMALE
888    FEMALE
889      MALE
890      MALE
Name: Sex, Length: 891, dtype: object

In [94]:
# Convert Embarked to lowercase
lowercase_Embarked = df['Embarked'].str.lower()
lowercase_Embarked


0      s
1      c
2      s
3      s
4      s
      ..
886    s
887    s
888    s
889    c
890    q
Name: Embarked, Length: 891, dtype: object

    II.Stripping Whitespace:
    Remove leading and trailing whitespace using .str.strip().

In [103]:
# Strip leading and trailing whitespace from name (example on 1 value)
df["Name"][0] = "   hii    kk    "
remove_whitespace = df['Name'].str.strip()
remove_whitespace

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Name"][0] = "   hii    kk    "


0                                              hii    kk
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

    III.Replacing Substrings

In [106]:
# Replace 'hii' with 'hello'
replaced_text = df['Name'].str.replace('hii', 'hello')
replaced_text

0                                        hello    kk    
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

    IV.String Splitting

In [107]:
# Split city names into two columns using a , as a delimiter
df["Name"][0] = "hello , ali"
print(df["Name"])
splitted_text = df['Name'].str.split(',', 1, expand=True)
splitted_text

0                                            hello , ali
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Name"][0] = "hello , ali"
  splitted_text = df['Name'].str.split(',', 1, expand=True)


Unnamed: 0,0,1
0,hello,ali
1,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,Heikkinen,Miss. Laina
3,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,Allen,Mr. William Henry
...,...,...
886,Montvila,Rev. Juozas
887,Graham,Miss. Margaret Edith
888,Johnston,"Miss. Catherine Helen ""Carrie"""
889,Behr,Mr. Karl Howell


In [108]:
# Split city names into two columns using a , as a delimiter "expand = False"
df["Name"][0] = "hello , ali"
print(df["Name"])
splitted_text = df['Name'].str.split(',', 1, expand=False)
splitted_text

0                                            hello , ali
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["Name"][0] = "hello , ali"
  splitted_text = df['Name'].str.split(',', 1, expand=False)


0                                         [hello ,  ali]
1      [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                              [Heikkinen,  Miss. Laina]
3        [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                            [Allen,  Mr. William Henry]
                             ...                        
886                             [Montvila,  Rev. Juozas]
887                      [Graham,  Miss. Margaret Edith]
888          [Johnston,  Miss. Catherine Helen "Carrie"]
889                             [Behr,  Mr. Karl Howell]
890                               [Dooley,  Mr. Patrick]
Name: Name, Length: 891, dtype: object

    V.String Concatenation:
    Combine string columns into a single column

In [112]:
splitted_name = df['Name'].str.split(',', 1, expand=True)
splitted_name.rename(columns = {0:'first', 1:"second"}, inplace = True)
splitted_name

  splitted_name = df['Name'].str.split(',', 1, expand=True)


Unnamed: 0,first,second
0,hello,ali
1,Cumings,Mrs. John Bradley (Florence Briggs Thayer)
2,Heikkinen,Miss. Laina
3,Futrelle,Mrs. Jacques Heath (Lily May Peel)
4,Allen,Mr. William Henry
...,...,...
886,Montvila,Rev. Juozas
887,Graham,Miss. Margaret Edith
888,Johnston,"Miss. Catherine Helen ""Carrie"""
889,Behr,Mr. Karl Howell


In [113]:
# Concatenate 'first' and 'second' columns with a space separator
df['Concatenated_name'] = splitted_name["first"] + " , " + splitted_name["second"]
df['Concatenated_name']


0                                          hello  ,  ali
1      Cumings ,  Mrs. John Bradley (Florence Briggs ...
2                               Heikkinen ,  Miss. Laina
3         Futrelle ,  Mrs. Jacques Heath (Lily May Peel)
4                             Allen ,  Mr. William Henry
                             ...                        
886                              Montvila ,  Rev. Juozas
887                       Graham ,  Miss. Margaret Edith
888           Johnston ,  Miss. Catherine Helen "Carrie"
889                              Behr ,  Mr. Karl Howell
890                                Dooley ,  Mr. Patrick
Name: Concatenated_name, Length: 891, dtype: object

    VI.String Length:
    Calculate the length of strings

In [114]:
# Calculate the length of names
df['Name_length'] = df['Name'].str.len()
df["Name_length"]

0      11
1      51
2      22
3      44
4      24
       ..
886    21
887    28
888    40
889    21
890    19
Name: Name_length, Length: 891, dtype: int64

    VII.String Matching and Filtering
    filter rows with specific substrings.

In [117]:
# Filter rows where 'Name' contains 'Johnston'
Johnston_Name = df[df['Name'].str.contains('Johnston')]
Johnston_Name

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
783,784,0,3,"Johnston, Mr. Andrew G",male,,1,2,W./C. 6607,23.45,,S,"Johnston , Mr. Andrew G",22
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S,"Johnston , Miss. Catherine Helen ""Carrie""",40


    VII.Counting Substrings:
    Count the occurrences of a substring

In [119]:
# Count the number of 'f's in sex
count_f = df['Sex'].str.count('f')
count_f

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    1
889    0
890    0
Name: Sex, Length: 891, dtype: int64

    VIII.Checking Prefix and Suffix:
    Use .str.startswith() and .str.endswith() to check if strings start or end with a specific substring

In [121]:
# Check if names start with 'behr'
names = df[df['Name'].str.startswith('Behr')]
names 



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C,"Behr , Mr. Karl Howell",21


In [125]:
# Check if names end with 'Andrew'
names = df[df['Name'].str.endswith('Andrew')]
names 



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
757,758,0,2,"Bailey, Mr. Percy Andrew",male,18.0,0,0,29108,11.5,,S,"Bailey , Mr. Percy Andrew",24


#### 6.Indexing
 index is a fundamental concept that represents the labels for rows in a DataFrame or Series. It allows for efficient and flexible data selection, alignment, and retrieval. The index serves as a unique identifier for each row in a DataFrame and can be used to label, locate, and access data.

In [140]:
df.set_index(['PassengerId'], inplace=True)
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"hello , ali",male,22.0,1,0,A/5 21171,7.2500,,S,"hello , ali",11
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,"Heikkinen , Miss. Laina",22
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,"Allen , Mr. William Henry",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,"Montvila , Rev. Juozas",21
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,"Johnston , Miss. Catherine Helen ""Carrie""",40
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,"Behr , Mr. Karl Howell",21


#### 7.Slicing

    I.Slicing Rows:
    You can slice rows by specifying a range of row indices. 
    The result is a new DataFrame.

In [127]:
# Slice rows from index 1 to 3 (inclusive)
sliced_df = df[1:4]
sliced_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,"Heikkinen , Miss. Laina",22
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44


    II.Slicing Columns:
    You can select specific columns by their names. 
    The result is a new DataFrame with the selected columns.

In [128]:
# Select 'Name' and 'Age' columns
selected_columns = df[['Name', 'Age']]
selected_columns

Unnamed: 0,Name,Age
0,"hello , ali",22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,"Heikkinen, Miss. Laina",26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,"Allen, Mr. William Henry",35.0
...,...,...
886,"Montvila, Rev. Juozas",27.0
887,"Graham, Miss. Margaret Edith",19.0
888,"Johnston, Miss. Catherine Helen ""Carrie""",
889,"Behr, Mr. Karl Howell",26.0


    III.Slicing Rows and Columns Simultaneously
    can slice both rows and columns simultaneously using .loc[] or .iloc[]

- Using .loc[] with labels

In [131]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
0,1,0,3,"hello , ali",male,22.0,1,0,A/5 21171,7.2500,,S,"hello , ali",11
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,"Heikkinen , Miss. Laina",22
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,"Allen , Mr. William Henry",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,"Montvila , Rev. Juozas",21
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,"Johnston , Miss. Catherine Helen ""Carrie""",40
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,"Behr , Mr. Karl Howell",21


In [137]:
# Slice rows with select 'Name' and 'Age' columns
sliced_df = df.loc[:,['Fare', 'Age']]
sliced_df

Unnamed: 0,Fare,Age
0,7.2500,22.0
1,71.2833,38.0
2,7.9250,26.0
3,53.1000,35.0
4,8.0500,35.0
...,...,...
886,13.0000,27.0
887,30.0000,19.0
888,23.4500,
889,30.0000,26.0


In [139]:
# Select rows where 'Age' is greater than 30
selected_rows = df.loc[df['Age'] > 30]
selected_rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,"Allen , Mr. William Henry",24
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,"McCarthy , Mr. Timothy J",23
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S,"Bonnell , Miss. Elizabeth",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S,"Vander Cruyssen , Mr. Victor",27
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C,"Potter , Mrs. Thomas Jr (Lily Alexenia Wilson)",45
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S,"Markun , Mr. Johann",18
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q,"Rice , Mrs. William (Margaret Norton)",36


In [142]:
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"hello , ali",male,22.0,1,0,A/5 21171,7.2500,,S,"hello , ali",11
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,"Heikkinen , Miss. Laina",22
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,"Allen , Mr. William Henry",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,"Montvila , Rev. Juozas",21
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,"Johnston , Miss. Catherine Helen ""Carrie""",40
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,"Behr , Mr. Karl Howell",21


- Using .iloc[] with indices

 is primarily label-based and allows you to select rows and columns using labels or boolean conditions.

In [144]:
# Select rows with integer indices 1 and 3
selected_rows = df.iloc[[1, 3]]
selected_rows

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44


In [145]:
# Select columns at integer indices 0 and 1 (Survived and Pclass)
selected_columns = df.iloc[:, [0, 1]]
selected_columns

Unnamed: 0_level_0,Survived,Pclass
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,3
2,1,1
3,1,3
4,1,1
5,0,3
...,...,...
887,0,2
888,1,1
889,0,3
890,1,1


In [147]:
# Select rows at integer indices 1 and 3 and columns at integer indices 0 and 1
selected_data = df.iloc[[1, 3], [0, 1]]
selected_data

Unnamed: 0_level_0,Survived,Pclass
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
2,1,1
4,1,1


#### 8.Deal with Null Values

    I.Drop Null values
    You can remove rows with null values using the dropna() method.

In [148]:
# Remove rows with null values
df_cleaned = df.dropna()
df_cleaned

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,"McCarthy , Mr. Timothy J",23
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S,"Sandstrom , Miss. Marguerite Rut",31
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S,"Bonnell , Miss. Elizabeth",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S,"Beckwith , Mrs. Richard Leonard (Sallie Monyp...",48
873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S,"Carlsson , Mr. Frans Olof",24
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C,"Potter , Mrs. Thomas Jr (Lily Alexenia Wilson)",45
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28


- drop rows when age column are null

In [155]:
# Drop rows with null values in the 'Age' column
df_cleaned_age = df.dropna(subset=['Age'])
df_cleaned_age

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"hello , ali",male,22.0,1,0,A/5 21171,7.2500,,S,"hello , ali",11
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,"Heikkinen , Miss. Laina",22
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,"Allen , Mr. William Henry",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q,"Rice , Mrs. William (Margaret Norton)",36
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,"Montvila , Rev. Juozas",21
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,"Behr , Mr. Karl Howell",21


- drop rows when age either cabin columns are null

In [156]:
# Drop rows with null values in the 'Age' or 'Cabin' columns
df_cleaned_age_and_cabin = df.dropna(subset=['Age','Cabin'])
df_cleaned_age_and_cabin

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,"McCarthy , Mr. Timothy J",23
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S,"Sandstrom , Miss. Marguerite Rut",31
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S,"Bonnell , Miss. Elizabeth",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S,"Beckwith , Mrs. Richard Leonard (Sallie Monyp...",48
873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S,"Carlsson , Mr. Frans Olof",24
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C,"Potter , Mrs. Thomas Jr (Lily Alexenia Wilson)",45
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28



- drop column when any row are null

In [157]:
# Remove columns with null values
df_no_null_columns = df.dropna(axis=1)
df_no_null_columns

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,"hello , ali",male,1,0,A/5 21171,7.2500,"hello , ali",11
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.9250,"Heikkinen , Miss. Laina",22
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1000,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.0500,"Allen , Mr. William Henry",24
...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,0,0,211536,13.0000,"Montvila , Rev. Juozas",21
888,1,1,"Graham, Miss. Margaret Edith",female,0,0,112053,30.0000,"Graham , Miss. Margaret Edith",28
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,1,2,W./C. 6607,23.4500,"Johnston , Miss. Catherine Helen ""Carrie""",40
890,1,1,"Behr, Mr. Karl Howell",male,0,0,111369,30.0000,"Behr , Mr. Karl Howell",21


- drop rows when any column are null

In [159]:
# Remove rows with null values
df_no_null_rows = df.dropna(axis=0)
df_no_null_rows

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,"McCarthy , Mr. Timothy J",23
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S,"Sandstrom , Miss. Marguerite Rut",31
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S,"Bonnell , Miss. Elizabeth",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S,"Beckwith , Mrs. Richard Leonard (Sallie Monyp...",48
873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S,"Carlsson , Mr. Frans Olof",24
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C,"Potter , Mrs. Thomas Jr (Lily Alexenia Wilson)",45
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28


    II.Replace Null with value

- You can replace null values with a specific value using the fillna() method.

In [161]:
# Fill null values of Age Column with a specific value (e.g., 0)
df_filled_age = df['Age'].fillna(0, inplace=False)
df_filled_age

PassengerId
1      22.0
2      38.0
3      26.0
4      35.0
5      35.0
       ... 
887    27.0
888    19.0
889     0.0
890    26.0
891    32.0
Name: Age, Length: 891, dtype: float64

In [162]:
# Fill null values of Age Column with a mean value 
df_filled_age = df['Age'].fillna(df["Age"].mean(), inplace=False)
df_filled_age

PassengerId
1      22.000000
2      38.000000
3      26.000000
4      35.000000
5      35.000000
         ...    
887    27.000000
888    19.000000
889    29.699118
890    26.000000
891    32.000000
Name: Age, Length: 891, dtype: float64

- replace all Null values

In [163]:
# Fill null values with a specific value (e.g., 0)
df_filled = df.fillna(0)
df_filled

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"hello , ali",male,22.0,1,0,A/5 21171,7.2500,0,S,"hello , ali",11
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,0,S,"Heikkinen , Miss. Laina",22
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,0,S,"Allen , Mr. William Henry",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,0,S,"Montvila , Rev. Juozas",21
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,0.0,1,2,W./C. 6607,23.4500,0,S,"Johnston , Miss. Catherine Helen ""Carrie""",40
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,"Behr , Mr. Karl Howell",21


-  can fill null values using the previous or next valid value in the column using ffill and bfill arguments in fillna().

In [164]:
# Forward fill null values (carry forward the previous value)
df_ffill = df.fillna(method='ffill')
df_ffill

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"hello , ali",male,22.0,1,0,A/5 21171,7.2500,,S,"hello , ali",11
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,C85,S,"Heikkinen , Miss. Laina",22
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,C123,S,"Allen , Mr. William Henry",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,C50,S,"Montvila , Rev. Juozas",21
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,19.0,1,2,W./C. 6607,23.4500,B42,S,"Johnston , Miss. Catherine Helen ""Carrie""",40
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,"Behr , Mr. Karl Howell",21


In [165]:
# Backward fill null values (use the next valid value)
df_bfill = df.fillna(method='bfill')
df_bfill

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"hello , ali",male,22.0,1,0,A/5 21171,7.2500,C85,S,"hello , ali",11
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,C123,S,"Heikkinen , Miss. Laina",22
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,E46,S,"Allen , Mr. William Henry",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,B42,S,"Montvila , Rev. Juozas",21
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,26.0,1,2,W./C. 6607,23.4500,C148,S,"Johnston , Miss. Catherine Helen ""Carrie""",40
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,"Behr , Mr. Karl Howell",21


- interpolate missing values using methods like linear, polynomial, or spline. (most efficint way)


In [166]:
# Interpolate missing values linearly
df_interpolated = df.interpolate(method='linear')
df_interpolated

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,0,3,"hello , ali",male,22.0,1,0,A/5 21171,7.2500,,S,"hello , ali",11
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,"Heikkinen , Miss. Laina",22
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,"Allen , Mr. William Henry",24
...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,"Montvila , Rev. Juozas",21
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,22.5,1,2,W./C. 6607,23.4500,,S,"Johnston , Miss. Catherine Helen ""Carrie""",40
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,"Behr , Mr. Karl Howell",21


#### 9.Non Numerical Columns

    dealing with non-numeric columns in a dataset

    I.One Hot Encoding
    One-hot encoding is a popular technique used to encode categorical variables. 
    It creates binary indicator columns for each unique category in the original column. Each indicator column represents whether the corresponding category is present or not (1 or 0).

when to use one-hot encoding:
- When the categorical variable has no inherent order or hierarchy.
- When the number of unique categories is relatively small.
- When the presence or absence of each category is meaningful.

In [167]:
# One-hot encode the 'Color' column
encoded_df = pd.get_dummies(df, columns=['Embarked'])
encoded_df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Concatenated_name,Name_length,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,3,"hello , ali",male,22.0,1,0,A/5 21171,7.2500,,"hello , ali",11,0,0,1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,"Cumings , Mrs. John Bradley (Florence Briggs ...",51,1,0,0
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,"Heikkinen , Miss. Laina",22,0,0,1
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44,0,0,1
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,"Allen , Mr. William Henry",24,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,"Montvila , Rev. Juozas",21,0,0,1
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,"Graham , Miss. Margaret Edith",28,0,0,1
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,"Johnston , Miss. Catherine Helen ""Carrie""",40,0,0,1
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,"Behr , Mr. Karl Howell",21,1,0,0


    II.Label encoding
     is a technique that assigns a numeric label to each unique category in the original column. 
     Each category is mapped to a unique integer value.

When to use label encoding:
- When the categorical variable has an inherent order or hierarchy.
- When converting the categories to a numeric scale is meaningful for the analysis or model.

In [170]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
df2 = df.copy(deep=True) # copy dataframe to another variable , deep=True so any change of copied dataframe will not affect the original dataframe
df2['Embarked_new'] = label_encoder.fit_transform(df['Embarked'])
df2

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Concatenated_name,Name_length,Embarked_new
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,0,3,"hello , ali",male,22.0,1,0,A/5 21171,7.2500,,S,"hello , ali",11,2
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"Cumings , Mrs. John Bradley (Florence Briggs ...",51,0
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,"Heikkinen , Miss. Laina",22,2
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,"Futrelle , Mrs. Jacques Heath (Lily May Peel)",44,2
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,"Allen , Mr. William Henry",24,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,"Montvila , Rev. Juozas",21,2
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,"Graham , Miss. Margaret Edith",28,2
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,"Johnston , Miss. Catherine Helen ""Carrie""",40,2
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,"Behr , Mr. Karl Howell",21,0


    III.Ordinal encoding
    is similar to label encoding, but it assigns numeric labels based on the order or hierarchy of the categories.
    The categories are mapped to integers based on their order, which can be determined manually or inferred from the data.

When to use ordinal encoding:
- When the categorical variable has an inherent order or hierarchy, and the order needs to be preserved in the encoded values.
- When converting the categories to a numeric scale with meaningful order is important for the analysis or model.

In [174]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Create a sample DataFrame
data = {'Name': ['John', 'Jane', 'Mike', 'Sarah'],
        'Education': ['High School', "Bachelor's", "Master's", "Bachelor's"]}
df = pd.DataFrame(data)
print("original dataframe\n" , df)
print("="*50)
# Define the order of categories for ordinal encoding
education_order = ['High School', "Bachelor's", "Master's"]

# Apply ordinal encoding to the 'Education' column
ordinal_encoder = OrdinalEncoder(categories=[education_order])
df_encoded = df.copy(deep=True)
df_encoded['Education'] = ordinal_encoder.fit_transform(df[['Education']])

# Display the modified DataFrame
print("encoded dataframe\n" ,df_encoded)

original dataframe
     Name    Education
0   John  High School
1   Jane   Bachelor's
2   Mike     Master's
3  Sarah   Bachelor's
encoded dataframe
     Name  Education
0   John        0.0
1   Jane        1.0
2   Mike        2.0
3  Sarah        1.0


    IV.Replace with numerical values (defined by you)
    

In [176]:
import pandas as pd

# Create a sample DataFrame
data = {'Name': ['John', 'Jane', 'Mike', 'Sarah'],
        'Education': ['High School', "Bachelor's", "Master's", "Bachelor's"]}
df = pd.DataFrame(data)
print("original dataframe\n" , df)
print("="*50)

df_encoded = df.copy(deep=True)
# Define a mapping for replacing education levels with numerical values
education_mapping = {'High School': 6, "Bachelor's": 7, "Master's": 8}

# Replace the 'Education' column with numerical values
df_encoded['Education'] = df['Education'].map(education_mapping)

# Display the modified DataFrame
print("encoded dataframe\n" ,df_encoded)

original dataframe
     Name    Education
0   John  High School
1   Jane   Bachelor's
2   Mike     Master's
3  Sarah   Bachelor's
encoded dataframe
     Name  Education
0   John          6
1   Jane          7
2   Mike          8
3  Sarah          7
