# DataFrame
A DataFrame is like a table or a 2D spreadsheet that holds data in rows and columns. Think of it as a full Excel sheet where each column is a Series, and each row represents a data entry. It’s one of the most essential structures for working with data in Pandas.

<b>Tabular Structure:</b> A DataFrame stores data in a table format, where data is arranged in rows and columns. Each column in a DataFrame is a Series, and all columns together make up the DataFrame.

<b>Heterogeneous Data:</b> Unlike a Series, a DataFrame can hold different types of data across its columns. For example, one column can store numbers (integers or floats), while another can store text (strings), and a third can store dates.

<b>Labels (Indexes and Column Names): A DataFrame has two types of labels:</b>

<b>Row labels (index):</b> Similar to a Series, each row in a DataFrame has an index that helps locate and access data. It can be a number or a string.

<b>Column names:</b> Each column has a name (or label) that helps identify and access data in that column.

<b>Flexible Size:</b> A DataFrame has a dynamic size, meaning you can have any number of rows and columns. You can add or remove rows and columns as needed.

<b>Operations:</b> You can perform a wide range of operations on DataFrames

<b>Arithmetic operations:</b> Perform mathematical calculations on columns.

<b>Filtering and sorting:</b> Select specific rows or columns based on conditions and sort them.

<b>Grouping and aggregation:</b> Group data by certain criteria and calculate summary statistics like averages, sums, etc.

<b>Merging and joining:</b> Combine multiple DataFrames based on common columns or indexes.

<b>Data Types:</b> Each column in a DataFrame can hold a specific data type (integers, floats, strings, dates, etc.), but different columns can have different types of data.

<b>Handling Missing Data:</b> DataFrames have built-in functionality to handle missing data (NaN values). You can fill in missing values, drop rows or columns with missing data, or handle it based on your analysis needs.

<b>Similar to Excel or SQL Tables:</b> A DataFrame is similar to an Excel sheet or a SQL table. You can think of it as a collection of Series (columns) with row and column labels.

In [1]:
import numpy as np
import pandas as pd

In [4]:
dataframe = pd.DataFrame([1,2,3,4,5,6,7,8,9,10])

print(dataframe,type(dataframe),id(dataframe))

    0
0   1
1   2
2   3
3   4
4   5
5   6
6   7
7   8
8   9
9  10 <class 'pandas.core.frame.DataFrame'> 2489130245984


In [7]:
dataframe = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],columns=["first","second","third"])

print(dataframe,type(dataframe),id(dataframe))

   first  second  third
0      1       2      3
1      4       5      6
2      7       8      9
3     10      11     12 <class 'pandas.core.frame.DataFrame'> 2489142740688


In [71]:
data = {"Student_Name":["Akshay","Jay","Rohit"],
        "Maths":[78,98,86],
         "Science":[67,78,72],
        "English":[87,93,85]}

marks = pd.DataFrame(data)

marks

Unnamed: 0,Student_Name,Maths,Science,English
0,Akshay,78,67,87
1,Jay,98,78,93
2,Rohit,86,72,85


In [12]:
dataframe = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11]],columns=["first","second","third"])

print(dataframe,type(dataframe),id(dataframe))

   first  second  third
0      1       2    3.0
1      4       5    6.0
2      7       8    9.0
3     10      11    NaN <class 'pandas.core.frame.DataFrame'> 2489200117888


In [13]:
df

Unnamed: 0,Student_Name,Maths,Science,English
0,Akshay,78,67,87
1,Jay,98,78,93
2,Rohit,86,72,85


In [14]:
df["Science"]

0    67
1    78
2    72
Name: Science, dtype: int64

In [18]:
data = {"City":["Mumbai","Delhi","Banglore","Hyderabad"],
           "Temperature":[38,45,40,32]}
temp_df = pd.DataFrame(data)

In [24]:
data = {"City":["Mumbai","Delhi","Banglore","Chennai"],
           "Humidity":[68,65,75,70]}

humi_df = pd.DataFrame(data)

In [20]:
temp_df

Unnamed: 0,City,Temperature
0,Mumbai,38
1,Delhi,45
2,Banglore,40
3,Hyderabad,32


In [25]:
humi_df

Unnamed: 0,City,Humidity
0,Mumbai,68
1,Delhi,65
2,Banglore,75
3,Chennai,70


# concat() 

The concat() function in pandas combines two or more DataFrames into one, either vertically (stacking rows) or horizontally (adding columns), without changing the original DataFrames.

<b>Why use concat()</b>

It's used to merge separate DataFrames. 

Combining monthly sales data into one DataFrame.

Merging customer information stored in different DataFrames.

<b>Syntax:</b>

pd.concat([df1, df2], axis=0)

[df1, df2]: List of DataFrames to combine.

axis=0: Stacks rows (vertical). Use axis=1 to add columns (horizontal).

In [27]:
pd.concat([temp_df,humi_df])

Unnamed: 0,City,Temperature,Humidity
0,Mumbai,38.0,
1,Delhi,45.0,
2,Banglore,40.0,
3,Hyderabad,32.0,
0,Mumbai,,68.0
1,Delhi,,65.0
2,Banglore,,75.0
3,Chennai,,70.0


In [29]:
pd.concat([temp_df,humi_df],ignore_index = True)

Unnamed: 0,City,Temperature,Humidity
0,Mumbai,38.0,
1,Delhi,45.0,
2,Banglore,40.0,
3,Hyderabad,32.0,
4,Mumbai,,68.0
5,Delhi,,65.0
6,Banglore,,75.0
7,Chennai,,70.0


In [28]:
pd.concat([temp_df,humi_df],axis = 1)

Unnamed: 0,City,Temperature,City.1,Humidity
0,Mumbai,38,Mumbai,68
1,Delhi,45,Delhi,65
2,Banglore,40,Banglore,75
3,Hyderabad,32,Chennai,70


# merge()

The merge() function in pandas is used to combine two DataFrames based on common columns or indexes, similar to a SQL join.

<b>Why use merge()</b>

It is useful when you need to join related data from different DataFrames.

Merging customer details with purchase data based on a shared column like customer_id.

<b>Syntax:</b>

pd.merge(df1, df2, on='column_name', how='inner')

df1, df2: DataFrames to merge.

on='column_name': The common column for merging.

how='inner': Type of join ('inner', 'left', 'right', 'outer').

In [34]:
outer_join = pd.merge(temp_df,humi_df,on = "City",how ="outer")
outer_join

Unnamed: 0,City,Temperature,Humidity
0,Banglore,40.0,75.0
1,Chennai,,70.0
2,Delhi,45.0,65.0
3,Hyderabad,32.0,
4,Mumbai,38.0,68.0


In [31]:
pd.merge(temp_df,humi_df,on = "City",how ="inner")

Unnamed: 0,City,Temperature,Humidity
0,Mumbai,38,68
1,Delhi,45,65
2,Banglore,40,75


In [32]:
pd.merge(temp_df,humi_df,on = "City",how ="left")

Unnamed: 0,City,Temperature,Humidity
0,Mumbai,38,68.0
1,Delhi,45,65.0
2,Banglore,40,75.0
3,Hyderabad,32,


In [33]:
pd.merge(temp_df,humi_df,on = "City",how ="right")

Unnamed: 0,City,Temperature,Humidity
0,Mumbai,38.0,68
1,Delhi,45.0,65
2,Banglore,40.0,75
3,Chennai,,70


In [35]:
outer_join.to_csv("Outer_join_File.csv",index = False)

In [36]:
df = pd.read_csv("C:\\Users\\LENOVO\\Desktop\\Company_Dataset.csv")
df

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO
0,1,Tech Innovators,Technology,USA,120000000.0,500,2010-06-15,John Smith
1,2,Green Energy Corp,Energy,Germany,90000000.0,300,2008-09-01,Greta Thunberg
2,3,FinTech Solutions,Finance,UK,150000000.0,800,2012-04-22,Mark Johnson
3,4,HealthPlus,Healthcare,Canada,110000000.0,450,2015-01-10,Lisa Brown
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka
5,6,Foodies Delight,Food & Beverage,France,80000000.0,250,2016-07-05,Marie Dubois
6,7,EduWorld,Education,Australia,75000000.0,200,2011-03-19,Richard Davis
7,8,StyleHub,Fashion,Italy,95000000.0,400,2009-10-25,Giorgio Armani
8,9,CleanTech,Environment,Sweden,60000000.0,150,2013-12-12,Annika Svensson
9,10,SafeNet,Security,USA,140000000.0,600,2014-05-07,Michael Miller


# head()

The head() function in pandas is used to quickly view the first few rows of a DataFrame. By default, it shows the first 5 rows.

<b>Why use head()</b>

It’s helpful when you want to take a quick look at the data without displaying the entire DataFrame, especially when it’s large.

<b>Syntax:</b>

df.head(n)

n: (Optional) The number of rows to display. Default is 5.

In [37]:
df.head()

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO
0,1,Tech Innovators,Technology,USA,120000000.0,500,2010-06-15,John Smith
1,2,Green Energy Corp,Energy,Germany,90000000.0,300,2008-09-01,Greta Thunberg
2,3,FinTech Solutions,Finance,UK,150000000.0,800,2012-04-22,Mark Johnson
3,4,HealthPlus,Healthcare,Canada,110000000.0,450,2015-01-10,Lisa Brown
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka


In [38]:
df.head(10)

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO
0,1,Tech Innovators,Technology,USA,120000000.0,500,2010-06-15,John Smith
1,2,Green Energy Corp,Energy,Germany,90000000.0,300,2008-09-01,Greta Thunberg
2,3,FinTech Solutions,Finance,UK,150000000.0,800,2012-04-22,Mark Johnson
3,4,HealthPlus,Healthcare,Canada,110000000.0,450,2015-01-10,Lisa Brown
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka
5,6,Foodies Delight,Food & Beverage,France,80000000.0,250,2016-07-05,Marie Dubois
6,7,EduWorld,Education,Australia,75000000.0,200,2011-03-19,Richard Davis
7,8,StyleHub,Fashion,Italy,95000000.0,400,2009-10-25,Giorgio Armani
8,9,CleanTech,Environment,Sweden,60000000.0,150,2013-12-12,Annika Svensson
9,10,SafeNet,Security,USA,140000000.0,600,2014-05-07,Michael Miller


# tail()

The tail() function in pandas is used to view the last few rows of a DataFrame. By default, it shows the last 5 rows.

<b>Why use tail()</b>

It’s useful when you want to check the last entries of your data, especially in large datasets.

<b>Syntax:</b>
    
df.tail(n)

n: (Optional) The number of rows to display. Default is 5.

In [39]:
df.tail()

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO
25,26,FashionForward,Fashion,Italy,95000000.0,410,2009-03-21,Sophia Loren
26,27,HomeStyle,Retail,USA,80000000.0,340,2016-06-11,Dorothy Parker
27,28,SmartAgro,Agriculture,Netherlands,90000000.0,300,2014-12-03,Jan De Vries
28,29,InnovateTech,Technology,China,170000000.0,780,2004-11-14,Li Wei
29,30,GlobalFinance,Finance,Switzerland,200000000.0,950,2001-07-09,Pierre Dubois


In [40]:
df.tail(10)

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO
20,21,AutoParts,Automotive,Germany,105000000.0,490,2008-05-29,Klaus Fischer
21,22,HealthCare Plus,Healthcare,UK,145000000.0,670,2002-04-23,Sarah Thompson
22,23,TasteTheWorld,Food & Beverage,USA,65000000.0,200,2015-09-27,Anthony Martin
23,24,CleanAir,Environment,Denmark,50000000.0,140,2017-08-16,Lars Jensen
24,25,CyberSecure,Security,Canada,135000000.0,620,2011-02-05,Jonathan Clark
25,26,FashionForward,Fashion,Italy,95000000.0,410,2009-03-21,Sophia Loren
26,27,HomeStyle,Retail,USA,80000000.0,340,2016-06-11,Dorothy Parker
27,28,SmartAgro,Agriculture,Netherlands,90000000.0,300,2014-12-03,Jan De Vries
28,29,InnovateTech,Technology,China,170000000.0,780,2004-11-14,Li Wei
29,30,GlobalFinance,Finance,Switzerland,200000000.0,950,2001-07-09,Pierre Dubois


# sample()

The sample() function in pandas is used to randomly select rows from a DataFrame. It lets you view a random subset of your data.

<b>Why use sample()</b>

It’s helpful when you want to inspect a random sample of your data, especially for large datasets.

<b>Syntax:</b>

df.sample(n)

n: The number of random rows to display.

In [42]:
df.sample(10)

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO
28,29,InnovateTech,Technology,China,170000000.0,780,2004-11-14,Li Wei
24,25,CyberSecure,Security,Canada,135000000.0,620,2011-02-05,Jonathan Clark
18,19,Digital Solutions,Technology,India,130000000.0,580,2006-10-18,Rajesh Kumar
19,20,Oceanic Travels,Travel,Australia,95000000.0,330,2014-01-12,William Turner
10,11,BrightFuture,Education,UK,50000000.0,180,2018-08-14,Emma Wilson
20,21,AutoParts,Automotive,Germany,105000000.0,490,2008-05-29,Klaus Fischer
15,16,GameWorld,Entertainment,Japan,180000000.0,900,2005-01-15,Kenji Nakamura
14,15,ShopEase,Retail,USA,160000000.0,750,2007-07-01,Betty White
23,24,CleanAir,Environment,Denmark,50000000.0,140,2017-08-16,Lars Jensen
5,6,Foodies Delight,Food & Beverage,France,80000000.0,250,2016-07-05,Marie Dubois


# shape()

The shape attribute in pandas is used to find the dimensions of a DataFrame. It tells you how many rows and columns the DataFrame has.

<b>Why use shape</b>

It’s useful when you need to know the size of your data quickly.

<b>Syntax:</b>

df.shape

Returns a tuple (rows, columns).

In [104]:
df.shape

(30, 9)

# size

The size attribute in pandas is used to get the total number of elements (data points) in a DataFrame. It multiplies the number of rows by the number of columns.

<b>Why use size?</b>

It’s useful when you want to know how many total values are in your DataFrame.

<b>Syntax:</b>

df.size

Returns the total number of elements.

In [105]:
df.size

270

# df.columns

The columns attribute in pandas is used to get the column names of a DataFrame.

<b>Why use columns?</b>

It’s helpful when you need to see the names of all the columns in your DataFrame.

<b>Syntax:</b>

df.columns

Returns the column names as an Index object.

In [43]:
df.columns

Index(['CompanyID', 'CompanyName', 'Industry', 'Country', 'Revenue',
       'Employees', 'FoundingDate', 'CEO'],
      dtype='object')

# df.info()

The info() function in pandas provides a summary of your DataFrame, including the number of rows, column names, data types, and missing values.

<b>Why use info()</b>

It’s helpful to quickly understand the structure of your data and check for missing values.

<b>Syntax:</b>

df.info()

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   CompanyID     30 non-null     int64  
 1   CompanyName   30 non-null     object 
 2   Industry      30 non-null     object 
 3   Country       30 non-null     object 
 4   Revenue       30 non-null     float64
 5   Employees     30 non-null     int64  
 6   FoundingDate  30 non-null     object 
 7   CEO           30 non-null     object 
dtypes: float64(1), int64(2), object(5)
memory usage: 2.0+ KB


In [48]:
df[["CompanyName","Employees"]].head()

Unnamed: 0,CompanyName,Employees
0,Tech Innovators,500
1,Green Energy Corp,300
2,FinTech Solutions,800
3,HealthPlus,450
4,AutoDrive,1000


# df.describe()

The describe() function in pandas gives statistical summaries of your DataFrame, such as count, mean, min, max, and percentiles for numeric columns.

<b>Why use describe()</b>

It helps you quickly understand the distribution and statistics of your data.

<b>Syntax:</b>

df.describe()

In [49]:
df.describe()

Unnamed: 0,CompanyID,Revenue,Employees
count,30.0,30.0,30.0
mean,15.5,115833300.0,498.333333
std,8.803408,45392600.0,270.211709
min,1.0,50000000.0,140.0
25%,8.25,81250000.0,300.0
50%,15.5,107500000.0,460.0
75%,22.75,143750000.0,657.5
max,30.0,210000000.0,1100.0


In [50]:
df.head()

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO
0,1,Tech Innovators,Technology,USA,120000000.0,500,2010-06-15,John Smith
1,2,Green Energy Corp,Energy,Germany,90000000.0,300,2008-09-01,Greta Thunberg
2,3,FinTech Solutions,Finance,UK,150000000.0,800,2012-04-22,Mark Johnson
3,4,HealthPlus,Healthcare,Canada,110000000.0,450,2015-01-10,Lisa Brown
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka


# iloc[]

The iloc[] function in pandas is used for integer-location-based indexing. It allows you to select rows and columns by their integer positions.

<b>Why use iloc[]</b>

It’s useful when you want to access specific rows and columns based on their numeric index, regardless of the actual labels.

<b>Syntax:</b>

df.iloc[row_index, column_index]

row_index: Position of the row(s) to select.

column_index: Position of the column(s) to select.

In [51]:
df.iloc[0]

CompanyID                     1
CompanyName     Tech Innovators
Industry             Technology
Country                     USA
Revenue             120000000.0
Employees                   500
FoundingDate         2010-06-15
CEO                  John Smith
Name: 0, dtype: object

In [52]:
df.iloc[2:8]

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO
2,3,FinTech Solutions,Finance,UK,150000000.0,800,2012-04-22,Mark Johnson
3,4,HealthPlus,Healthcare,Canada,110000000.0,450,2015-01-10,Lisa Brown
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka
5,6,Foodies Delight,Food & Beverage,France,80000000.0,250,2016-07-05,Marie Dubois
6,7,EduWorld,Education,Australia,75000000.0,200,2011-03-19,Richard Davis
7,8,StyleHub,Fashion,Italy,95000000.0,400,2009-10-25,Giorgio Armani


In [53]:
df.iloc[2:8,4]

2    150000000.0
3    110000000.0
4    200000000.0
5     80000000.0
6     75000000.0
7     95000000.0
Name: Revenue, dtype: float64

In [54]:
df.iloc[2:6,2:5]

Unnamed: 0,Industry,Country,Revenue
2,Finance,UK,150000000.0
3,Healthcare,Canada,110000000.0
4,Automotive,Japan,200000000.0
5,Food & Beverage,France,80000000.0


In [57]:
df.iloc[2:6,-1:-4:-1] 

Unnamed: 0,CEO,FoundingDate,Employees
2,Mark Johnson,2012-04-22,800
3,Lisa Brown,2015-01-10,450
4,Hiroshi Tanaka,2000-11-30,1000
5,Marie Dubois,2016-07-05,250


# loc[]

The loc[] function in pandas is used for label-based indexing. It allows you to select rows and columns by their labels (names) instead of their integer positions.

<b>Why use loc[]</b>

It’s useful when you want to access specific rows and columns based on their labels, making your code more readable.

<b>Syntax:</b>

df.loc[row_label, column_label]

row_label: Label of the row(s) to select.

column_label: Label of the column(s) to select.

In [60]:
df.loc[2:5,"Employees"]

2     800
3     450
4    1000
5     250
Name: Employees, dtype: int64

In [61]:
df.loc[2:5,["CEO","Employees"]]

Unnamed: 0,CEO,Employees
2,Mark Johnson,800
3,Lisa Brown,450
4,Hiroshi Tanaka,1000
5,Marie Dubois,250


In [62]:
df.columns

Index(['CompanyID', 'CompanyName', 'Industry', 'Country', 'Revenue',
       'Employees', 'FoundingDate', 'CEO'],
      dtype='object')

In [63]:
df.loc[10:20,"Industry":"CEO"]

Unnamed: 0,Industry,Country,Revenue,Employees,FoundingDate,CEO
10,Education,UK,50000000.0,180,2018-08-14,Emma Wilson
11,Healthcare,USA,85000000.0,350,2010-02-18,Charles Moore
12,Travel,Spain,70000000.0,220,2012-09-30,Lucia Rodriguez
13,Construction,Germany,125000000.0,520,2011-11-22,Hans Muller
14,Retail,USA,160000000.0,750,2007-07-01,Betty White
15,Entertainment,Japan,180000000.0,900,2005-01-15,Kenji Nakamura
16,Agriculture,Brazil,115000000.0,470,2013-06-20,Pedro Silva
17,Aviation,USA,210000000.0,1100,1999-03-10,Linda Collins
18,Technology,India,130000000.0,580,2006-10-18,Rajesh Kumar
19,Travel,Australia,95000000.0,330,2014-01-12,William Turner


# value_counts()

The value_counts() function in pandas is used to count the unique values in a Series. It returns a new Series with the counts of each unique value.

<b>Why use value_counts()</b>

It’s helpful when you want to see the distribution of values in a column, such as how many times each category appears.

<b>Syntax:</b>

df['column_name'].value_counts()

In [64]:
df["Industry"].value_counts()

Industry
Technology         3
Healthcare         3
Finance            2
Automotive         2
Food & Beverage    2
Education          2
Fashion            2
Environment        2
Security           2
Travel             2
Retail             2
Agriculture        2
Energy             1
Construction       1
Entertainment      1
Aviation           1
Name: count, dtype: int64

In [66]:
 df["Industry"].unique()

array(['Technology', 'Energy', 'Finance', 'Healthcare', 'Automotive',
       'Food & Beverage', 'Education', 'Fashion', 'Environment',
       'Security', 'Travel', 'Construction', 'Retail', 'Entertainment',
       'Agriculture', 'Aviation'], dtype=object)

In [68]:
df["Industry"].nunique()

16

In [69]:
df.head()

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO
0,1,Tech Innovators,Technology,USA,120000000.0,500,2010-06-15,John Smith
1,2,Green Energy Corp,Energy,Germany,90000000.0,300,2008-09-01,Greta Thunberg
2,3,FinTech Solutions,Finance,UK,150000000.0,800,2012-04-22,Mark Johnson
3,4,HealthPlus,Healthcare,Canada,110000000.0,450,2015-01-10,Lisa Brown
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka


In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   CompanyID     30 non-null     int64  
 1   CompanyName   30 non-null     object 
 2   Industry      30 non-null     object 
 3   Country       30 non-null     object 
 4   Revenue       30 non-null     float64
 5   Employees     30 non-null     int64  
 6   FoundingDate  30 non-null     object 
 7   CEO           30 non-null     object 
dtypes: float64(1), int64(2), object(5)
memory usage: 2.0+ KB


In [72]:
marks

Unnamed: 0,Student_Name,Maths,Science,English
0,Akshay,78,67,87
1,Jay,98,78,93
2,Rohit,86,72,85


In [74]:
marks["Avera_Marks"] = (marks["Maths"]+marks["Science"]+marks["English"])/3

In [75]:
marks

Unnamed: 0,Student_Name,Maths,Science,English,Avera_Marks
0,Akshay,78,67,87,77.333333
1,Jay,98,78,93,89.666667
2,Rohit,86,72,85,81.0


In [77]:
marks["Science"] = marks["Science"]-10

In [78]:
marks

Unnamed: 0,Student_Name,Maths,Science,English,Avera_Marks
0,Akshay,78,57,87,77.333333
1,Jay,98,68,93,89.666667
2,Rohit,86,62,85,81.0


In [80]:
df.head()

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO
0,1,Tech Innovators,Technology,USA,120000000.0,500,2010-06-15,John Smith
1,2,Green Energy Corp,Energy,Germany,90000000.0,300,2008-09-01,Greta Thunberg
2,3,FinTech Solutions,Finance,UK,150000000.0,800,2012-04-22,Mark Johnson
3,4,HealthPlus,Healthcare,Canada,110000000.0,450,2015-01-10,Lisa Brown
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka


In [81]:
df["FoundingDate"] = pd.to_datetime(df["FoundingDate"])

In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   CompanyID     30 non-null     int64         
 1   CompanyName   30 non-null     object        
 2   Industry      30 non-null     object        
 3   Country       30 non-null     object        
 4   Revenue       30 non-null     float64       
 5   Employees     30 non-null     int64         
 6   FoundingDate  30 non-null     datetime64[ns]
 7   CEO           30 non-null     object        
dtypes: datetime64[ns](1), float64(1), int64(2), object(4)
memory usage: 2.0+ KB


In [83]:
df["Founding_Date_Year"] = df["FoundingDate"].dt.year

In [84]:
df.head()

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO,Founding_Date_Year
0,1,Tech Innovators,Technology,USA,120000000.0,500,2010-06-15,John Smith,2010
1,2,Green Energy Corp,Energy,Germany,90000000.0,300,2008-09-01,Greta Thunberg,2008
2,3,FinTech Solutions,Finance,UK,150000000.0,800,2012-04-22,Mark Johnson,2012
3,4,HealthPlus,Healthcare,Canada,110000000.0,450,2015-01-10,Lisa Brown,2015
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka,2000


In [85]:
df.head()

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO,Founding_Date_Year
0,1,Tech Innovators,Technology,USA,120000000.0,500,2010-06-15,John Smith,2010
1,2,Green Energy Corp,Energy,Germany,90000000.0,300,2008-09-01,Greta Thunberg,2008
2,3,FinTech Solutions,Finance,UK,150000000.0,800,2012-04-22,Mark Johnson,2012
3,4,HealthPlus,Healthcare,Canada,110000000.0,450,2015-01-10,Lisa Brown,2015
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka,2000


In [87]:
df[df["Country"] == "USA"]

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO,Founding_Date_Year
0,1,Tech Innovators,Technology,USA,120000000.0,500,2010-06-15,John Smith,2010
9,10,SafeNet,Security,USA,140000000.0,600,2014-05-07,Michael Miller,2014
11,12,MedEquip,Healthcare,USA,85000000.0,350,2010-02-18,Charles Moore,2010
14,15,ShopEase,Retail,USA,160000000.0,750,2007-07-01,Betty White,2007
17,18,SkyHigh Airlines,Aviation,USA,210000000.0,1100,1999-03-10,Linda Collins,1999
22,23,TasteTheWorld,Food & Beverage,USA,65000000.0,200,2015-09-27,Anthony Martin,2015
26,27,HomeStyle,Retail,USA,80000000.0,340,2016-06-11,Dorothy Parker,2016


In [90]:
df[(df["Country"] == "USA") & (df["Employees"] > 600)]

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO,Founding_Date_Year
14,15,ShopEase,Retail,USA,160000000.0,750,2007-07-01,Betty White,2007
17,18,SkyHigh Airlines,Aviation,USA,210000000.0,1100,1999-03-10,Linda Collins,1999


In [91]:
df.head()

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO,Founding_Date_Year
0,1,Tech Innovators,Technology,USA,120000000.0,500,2010-06-15,John Smith,2010
1,2,Green Energy Corp,Energy,Germany,90000000.0,300,2008-09-01,Greta Thunberg,2008
2,3,FinTech Solutions,Finance,UK,150000000.0,800,2012-04-22,Mark Johnson,2012
3,4,HealthPlus,Healthcare,Canada,110000000.0,450,2015-01-10,Lisa Brown,2015
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka,2000


In [95]:
df["Revenue"].sort_values(ascending = False)

17    210000000.0
29    200000000.0
4     200000000.0
15    180000000.0
28    170000000.0
14    160000000.0
2     150000000.0
21    145000000.0
9     140000000.0
24    135000000.0
18    130000000.0
13    125000000.0
0     120000000.0
16    115000000.0
3     110000000.0
20    105000000.0
25     95000000.0
19     95000000.0
7      95000000.0
1      90000000.0
27     90000000.0
11     85000000.0
5      80000000.0
26     80000000.0
6      75000000.0
12     70000000.0
22     65000000.0
8      60000000.0
23     50000000.0
10     50000000.0
Name: Revenue, dtype: float64

# sort_values()

The sort_values() function in pandas is used to sort a DataFrame by one or more columns in either ascending or descending order.

<b>Why use sort_values()</b>

It’s helpful when you want to organize your data based on specific criteria, such as sorting by sales figures or dates.

<b>Syntax:</b>

df.sort_values(by='column_name', ascending=True)

by: The column(s) to sort by.

ascending: Boolean value; True for ascending order, False for descending.

In [97]:
df.sort_values(by = "Revenue",ascending = False).head()

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO,Founding_Date_Year
17,18,SkyHigh Airlines,Aviation,USA,210000000.0,1100,1999-03-10,Linda Collins,1999
29,30,GlobalFinance,Finance,Switzerland,200000000.0,950,2001-07-09,Pierre Dubois,2001
4,5,AutoDrive,Automotive,Japan,200000000.0,1000,2000-11-30,Hiroshi Tanaka,2000
15,16,GameWorld,Entertainment,Japan,180000000.0,900,2005-01-15,Kenji Nakamura,2005
28,29,InnovateTech,Technology,China,170000000.0,780,2004-11-14,Li Wei,2004


In [112]:
Total_Employees = df.groupby(["Country"])["Employees"].sum()
Total_Employees

Country
Australia       530
Brazil          470
Canada         1070
China           780
Denmark         140
France          250
Germany        1310
India           580
Italy           810
Japan          1900
Netherlands     300
Spain           220
Sweden          150
Switzerland     950
UK             1650
USA            3840
Name: Employees, dtype: int64

In [100]:
df[["Industry","Country"]].value_counts()

Industry         Country    
Retail           USA            2
Fashion          Italy          2
Agriculture      Brazil         1
Food & Beverage  France         1
Travel           Australia      1
Technology       USA            1
                 India          1
                 China          1
Security         USA            1
                 Canada         1
Healthcare       USA            1
                 UK             1
                 Canada         1
Food & Beverage  USA            1
Finance          UK             1
Agriculture      Netherlands    1
Finance          Switzerland    1
Environment      Sweden         1
                 Denmark        1
Entertainment    Japan          1
Energy           Germany        1
Education        UK             1
                 Australia      1
Construction     Germany        1
Aviation         USA            1
Automotive       Japan          1
                 Germany        1
Travel           Spain          1
Name: count, dtype:

In [102]:
Total_Employees = df.groupby(["Industry","Country"])["Revenue"].sum()
Total_Employees

Industry         Country    
Agriculture      Brazil         115000000.0
                 Netherlands     90000000.0
Automotive       Germany        105000000.0
                 Japan          200000000.0
Aviation         USA            210000000.0
Construction     Germany        125000000.0
Education        Australia       75000000.0
                 UK              50000000.0
Energy           Germany         90000000.0
Entertainment    Japan          180000000.0
Environment      Denmark         50000000.0
                 Sweden          60000000.0
Fashion          Italy          190000000.0
Finance          Switzerland    200000000.0
                 UK             150000000.0
Food & Beverage  France          80000000.0
                 USA             65000000.0
Healthcare       Canada         110000000.0
                 UK             145000000.0
                 USA             85000000.0
Retail           USA            240000000.0
Security         Canada         135000000.0
   

In [103]:
Total_Employees = df.groupby("Country")["Employees"].sum().reset_index()
Total_Employees

Unnamed: 0,Country,Employees
0,Australia,530
1,Brazil,470
2,Canada,1070
3,China,780
4,Denmark,140
5,France,250
6,Germany,1310
7,India,580
8,Italy,810
9,Japan,1900


# duplicated()

The duplicated() function in pandas is used to identify duplicate rows in a DataFrame. It returns a boolean Series indicating whether each row is a duplicate of a previous row.

</b>Why use duplicated()</b>

It’s helpful when you want to check for and manage duplicate entries in your data.

<b>Syntax:</b>

df.duplicated()

Returns True for duplicate rows and False for unique rows.

In [108]:
df[df.duplicated()]

Unnamed: 0,CompanyID,CompanyName,Industry,Country,Revenue,Employees,FoundingDate,CEO,Founding_Date_Year


In [50]:
df = pd.read_csv("C:\\Users\\LENOVO\\Downloads\\titanic_dataset.csv")

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [4]:
df.size

5016

In [5]:
df.shape

(418, 12)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


In [7]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,0.363636,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.481622,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,0.0,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,0.0,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,1.0,3.0,39.0,1.0,0.0,31.5
max,1309.0,1.0,3.0,76.0,8.0,9.0,512.3292


In [8]:
df.describe(include = "object")

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,418,418,418,91,418
unique,418,2,363,76,3
top,"Kelly, Mr. James",male,PC 17608,B57 B59 B63 B66,S
freq,1,266,5,3,270


In [9]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [10]:
for i in df.describe(include = 'object').columns:
    print(i)
    print(df[i].unique())
   

Name
['Kelly, Mr. James' 'Wilkes, Mrs. James (Ellen Needs)'
 'Myles, Mr. Thomas Francis' 'Wirz, Mr. Albert'
 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)'
 'Svensson, Mr. Johan Cervin' 'Connolly, Miss. Kate'
 'Caldwell, Mr. Albert Francis'
 'Abrahim, Mrs. Joseph (Sophie Halaut Easu)' 'Davies, Mr. John Samuel'
 'Ilieff, Mr. Ylio' 'Jones, Mr. Charles Cresson'
 'Snyder, Mrs. John Pillsbury (Nelle Stevenson)' 'Howard, Mr. Benjamin'
 'Chaffee, Mrs. Herbert Fuller (Carrie Constance Toogood)'
 'del Carlo, Mrs. Sebastiano (Argenia Genovesi)' 'Keane, Mr. Daniel'
 'Assaf, Mr. Gerios' 'Ilmakangas, Miss. Ida Livija'
 'Assaf Khalil, Mrs. Mariana (Miriam")"' 'Rothschild, Mr. Martin'
 'Olsen, Master. Artur Karl' 'Flegenheim, Mrs. Alfred (Antoinette)'
 'Williams, Mr. Richard Norris II'
 'Ryerson, Mrs. Arthur Larned (Emily Maria Borie)'
 'Robins, Mr. Alexander A' 'Ostby, Miss. Helene Ragnhild'
 'Daher, Mr. Shedid' 'Brady, Mr. John Bertram' 'Samaan, Mr. Elias'
 'Louch, Mr. Charles Alexander' 'Jefferys,

In [45]:
df["Fare"].min()

0.0

In [46]:
df["Fare"].max()

512.3292

In [47]:
df["Fare"].median()

14.4542

In [48]:
df["Fare"].mode()

0    7.75
Name: Fare, dtype: float64

# var()

The var() function in pandas is used to calculate the variance of the data in a Series or DataFrame. Variance measures how much the data points differ from the mean.

<b>Why use var()</b>

You use var() to understand the spread or variability of your data. A higher variance indicates that the data points are more spread out from the mean, while a lower variance indicates they are closer to the mean.

<b>Syntax:</b>

df.var(axis=0, ddof=1)

axis=0: Calculate variance for each column (default). Use axis=1 for rows.

ddof=1: Degrees of freedom, used to adjust the calculation (default is 1 for sample variance).

In [51]:
df["Fare"].var()

3125.6570743195775

# std()

The std() function in pandas is used to calculate the standard deviation of the data in a Series or DataFrame. Standard deviation measures how spread out the data points are around the mean.

<b>Why use std()</b>

You use std() to understand the dispersion or spread of your data. A higher standard deviation indicates that the data points are more spread out from the mean, while a lower standard deviation means they are closer to the mean.

<b>Syntax:</b>

df.std(axis=0, ddof=1)

axis=0: Calculate standard deviation for each column (default). Use axis=1 for rows.

ddof=1: Degrees of freedom, used to adjust the calculation (default is 1 for sample standard deviation).

In [52]:
df["Fare"].std()

55.90757617997383

# df.cov()

The df.cov() function in pandas is used to calculate the covariance between columns in a DataFrame. Covariance indicates the direction of the relationship between two numerical variables—whether they increase or decrease together.

<b>Why use cov()</b>

You use cov() to understand how two variables move together. A positive covariance means that as one variable increases, the other tends to increase, while a negative covariance means that as one increases, the other tends to decrease.

<b>Syntax:</b>

df.cov(numeric_only=True)

numeric_only=True: This parameter ensures the covariance is calculated only for numeric columns, excluding non-numeric columns from the calculation.

In [53]:
df.cov(numeric_only = True)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,14595.166667,-1.352518,-2.720624,-59.369047,0.413669,5.107914,55.514238
Survived,-1.352518,0.23196,-0.044037,-8.8e-05,0.043165,0.075213,5.159417
Pclass,-2.720624,-0.044037,0.70869,-5.906358,0.00082,0.015467,-27.171232
Age,-59.369047,-8.8e-05,-5.906358,201.106695,-1.13527,-0.704115,291.83861
SibSp,0.413669,0.043165,0.00082,-1.13527,0.804178,0.2701,8.607981
Parch,5.107914,0.075213,0.015467,-0.704115,0.2701,0.963203,12.635175
Fare,55.514238,5.159417,-27.171232,291.83861,8.607981,12.635175,3125.657074


Covariance measures how much two variables change together, but it doesn't standardize the values like correlation does.

Fare and Age (291.8) have a strong positive covariance. Older passengers tend to pay higher fares.

PassengerId and Age (-59.3) have a negative covariance. As PassengerId increases (likely over time), age decreases slightly.

Pclass and Fare (-27.1) have negative covariance. Higher classes (lower Pclass) are associated with higher fares.

SibSp and Parch (0.8) show a positive relationship, suggesting families with siblings tend to also have parents/children on board.

# df.corr()

The df.corr() function in pandas is used to calculate the correlation between columns in a DataFrame. Correlation measures the strength and direction of a linear relationship between two variables.

<b>Why use corr()</b>

You use corr() to understand how closely related two numerical variables are. A correlation value close to 1 means a strong positive relationship, close to -1 means a strong negative relationship, and close to 0 means no linear relationship.

<b>Syntax:</b>

df.corr(numeric_only=True)

numeric_only=True: This ensures the correlation is calculated only for numeric columns, excluding non-numeric columns.

In [54]:
df.corr(numeric_only = True)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.023245,-0.026751,-0.034102,0.003818,0.04308,0.008211
Survived,-0.023245,1.0,-0.108615,-1.3e-05,0.099943,0.15912,0.191514
Pclass,-0.026751,-0.108615,1.0,-0.492143,0.001087,0.018721,-0.577147
Age,-0.034102,-1.3e-05,-0.492143,1.0,-0.091587,-0.061249,0.337932
SibSp,0.003818,0.099943,0.001087,-0.091587,1.0,0.306895,0.171539
Parch,0.04308,0.15912,0.018721,-0.061249,0.306895,1.0,0.230046
Fare,0.008211,0.191514,-0.577147,0.337932,0.171539,0.230046,1.0


This is a correlation matrix showing the strength of relationships between numeric variables.

Pclass and Fare (-0.577): Strong negative correlation. Higher class (1st class, lower Pclass number) is associated with higher fares.

Age and Fare (0.338): Moderate positive correlation. Older passengers tend to pay more for their tickets.

SibSp and Parch (0.307): Positive correlation. Families with more siblings often have more parents/children aboard.

Survived and Fare (0.192): Positive correlation. Higher fare-paying passengers had a slightly better chance of survival.

Pclass and Age (-0.492): Moderate negative correlation. Passengers in higher classes tend to be older.

In [56]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,408,409,410,411,412,413,414,415,416,417
PassengerId,892,893,894,895,896,897,898,899,900,901,...,1300,1301,1302,1303,1304,1305,1306,1307,1308,1309
Survived,0,1,0,0,1,0,1,0,1,0,...,1,1,1,1,1,0,1,0,0,0
Pclass,3,3,2,3,3,3,3,2,3,3,...,3,3,3,1,3,3,1,3,3,3
Name,"Kelly, Mr. James","Wilkes, Mrs. James (Ellen Needs)","Myles, Mr. Thomas Francis","Wirz, Mr. Albert","Hirvonen, Mrs. Alexander (Helga E Lindqvist)","Svensson, Mr. Johan Cervin","Connolly, Miss. Kate","Caldwell, Mr. Albert Francis","Abrahim, Mrs. Joseph (Sophie Halaut Easu)","Davies, Mr. John Samuel",...,"Riordan, Miss. Johanna Hannah""""","Peacock, Miss. Treasteall","Naughton, Miss. Hannah","Minahan, Mrs. William Edward (Lillian E Thorpe)","Henriksson, Miss. Jenny Lovisa","Spector, Mr. Woolf","Oliva y Ocana, Dona. Fermina","Saether, Mr. Simon Sivertsen","Ware, Mr. Frederick","Peter, Master. Michael J"
Sex,male,female,male,male,female,male,female,male,female,male,...,female,female,female,female,female,male,female,male,male,male
Age,34.5,47.0,62.0,27.0,22.0,14.0,30.0,26.0,18.0,21.0,...,,3.0,,37.0,28.0,,39.0,38.5,,
SibSp,0,1,0,0,1,0,0,1,0,2,...,0,1,0,1,0,0,0,0,0,1
Parch,0,0,0,0,1,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,1
Ticket,330911,363272,240276,315154,3101298,7538,330972,248738,2657,A/4 48871,...,334915,SOTON/O.Q. 3101315,365237,19928,347086,A.5. 3236,PC 17758,SOTON/O.Q. 3101262,359309,2668
Fare,7.8292,7.0,9.6875,8.6625,12.2875,9.225,7.6292,29.0,7.2292,24.15,...,7.7208,13.775,7.75,90.0,7.775,8.05,108.9,7.25,8.05,22.3583


In [57]:
group = df.groupby('Sex')['Survived'].agg(['mean', 'sum', 'count'])  # Apply multiple aggregations
group


Unnamed: 0_level_0,mean,sum,count
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1.0,152,152
male,0.0,0,266


In [11]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

# drop()

The df.drop() function in pandas is used to remove rows or columns from a DataFrame.

<b>Why use df.drop()</b>

You use drop() when you want to delete specific rows or columns that are no longer needed in your data.

<b>Syntax:</b>

df.drop(labels, axis=0, inplace=False)

labels: The row labels (index) or column names to be dropped.

axis=0: Drop rows (default). Use axis=1 to drop columns.

inplace=False: If set to True, it modifies the DataFrame in place without returning a new one.

df.drop('column_name', axis=1)

df.drop(0)  # drops the row with index 0

In [12]:
df.drop(columns = ["Cabin"],inplace = True)

In [13]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,S


In [14]:
df["Age"].mean()

30.272590361445783

In [15]:
df["Age"].median()

27.0

In [16]:
df["Age"].mode()

0    21.0
1    24.0
Name: Age, dtype: float64

# fillna()

The fillna() function in pandas is used to fill missing values (NaN) in a DataFrame or Series with a specified value.

<b>Why use fillna()</b>

You use fillna() to handle missing data by replacing NaN values with meaningful substitutes, like a fixed value or a method (like forward or backward fill).

<b>Syntax:</b>

df.fillna(value, method=None, inplace=False)

value: The value to replace NaN with.

method: Use 'ffill' (forward fill) or 'bfill' (backward fill) to fill NaN with the previous or next value.

inplace=False: If True, modifies the DataFrame in place.

To fill NaN values with 0:

df.fillna(0)

In [17]:
df["Age"] = df["Age"].fillna(df["Age"].median())

In [18]:
df["Fare"] = df["Fare"].fillna(df["Fare"].median())

In [19]:
df.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [20]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,S


In [21]:
df["Age"] = df["Age"].astype(int)

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          418 non-null    int32  
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         418 non-null    float64
 10  Embarked     418 non-null    object 
dtypes: float64(1), int32(1), int64(5), object(4)
memory usage: 34.4+ KB


# rename()

The rename() function in pandas is used to change the names of columns or index labels in a DataFrame.

<b>Why use rename()</b>

You use rename() when you want to modify column names or row index labels for clarity or consistency in your DataFrame.

<b>Syntax:</b>

df.rename(columns={'old_name': 'new_name'}, inplace=False)

columns: A dictionary mapping old column names to new column names.

inplace=False: If True, makes changes directly to the DataFrame without returning a new one.

<b>Example:</b>

df.rename(columns={'old_column': 'new_column'})

To rename an index label:

df.rename(index={0: 'first_row'})

In [23]:
df.rename(columns = {"Pclass": "Passenger_Class","Parch":"Parent_Children","Embarked":"Port"},inplace = True)

In [24]:
df.head()

Unnamed: 0,PassengerId,Survived,Passenger_Class,Name,Sex,Age,SibSp,Parent_Children,Ticket,Fare,Port
0,892,0,3,"Kelly, Mr. James",male,34,0,0,330911,7.8292,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7.0,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,Q
3,895,0,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,S


In [25]:
def Age_Category(x):
    if x<18:
        return "Children"
    elif x<50:
        return "Adult"
    else:
        return "Old"

# apply

The apply() function in pandas is used to apply a custom function to each element or along rows/columns of a DataFrame or Series.

<b>Why use apply()</b>

You use apply() when you want to perform custom operations on DataFrame columns or rows that go beyond built-in functions.

<b>Syntax:</b>

df.apply(func, axis=0)

func: The function to apply.

axis=0: Apply the function along the rows (axis=1 applies it to columns).

In [27]:
df["Age_Category"] = df["Age"].apply(Age_Category)

In [28]:
df.head()

Unnamed: 0,PassengerId,Survived,Passenger_Class,Name,Sex,Age,SibSp,Parent_Children,Ticket,Fare,Port,Age_Category
0,892,0,3,"Kelly, Mr. James",male,34,0,0,330911,7.8292,Q,Adult
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7.0,S,Adult
2,894,0,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,Q,Old
3,895,0,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,S,Adult
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,S,Adult


In [29]:
Survived_By_Category = df.groupby(["Age_Category"])["Survived"].sum().sort_values(ascending = False).reset_index()
Survived_By_Category

Unnamed: 0,Age_Category,Survived
0,Adult,119
1,Children,17
2,Old,16


# pivot_table()

The pivot_table() function in pandas is used to summarize and aggregate data in a DataFrame by creating a new table with grouped values.

<b>Why use pivot_table()</b>

You use pivot_table() when you want to create a summary from your data, like calculating the mean, sum, or count of specific columns, similar to pivot tables in Excel.

Syntax:

df.pivot_table(values, index, columns, aggfunc='mean')

values: Column(s) to aggregate.

index: Column(s) to group by (rows).

columns: Column(s) to group by (columns).

aggfunc: The aggregation function (e.g., mean, sum, count).

<b>Example:</b>

To create a pivot table summarizing sales by region and product:

df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='sum')


In [30]:
pivot_table_df = df.pivot_table(values = "Fare",index = "Sex",columns = "Passenger_Class",aggfunc="mean")
pivot_table_df 

Passenger_Class,1,2,3
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,115.591168,26.43875,13.735129
male,75.586551,20.184654,11.844349


# rank()

The rank() function in pandas is used to assign ranks to values in a Series or DataFrame based on their order.

<b>Why use rank()</b>

You use rank() when you want to rank the values in a column or DataFrame, often in ascending or descending order.

<b>Syntax:</b>

df.rank(axis=0, method='average', ascending=True)

axis=0: Rank values by rows. Use axis=1 to rank by columns.

method='average': The method to break ties (e.g., 'min', 'max', 'first').

ascending=True: Rank values in ascending order. Set to False for descending order.

<b>Example:</b>

To rank values in the 'Scores' column:

df['Rank'] = df['Scores'].rank()

In [69]:
df["Fare_Rank"] = df["Fare"].rank()

In [70]:
df.sort_values(by = "Fare_Rank",ascending = True).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare_Rank
266,1158,0,1,"Chisholm, Mr. Roderick Robert Crispin",male,,0,0,112051,0.0,,S,1.5
372,1264,0,1,"Ismay, Mr. Joseph Bruce",male,49.0,0,0,112058,0.0,B52 B54 B56,S,1.5
21,913,0,3,"Olsen, Master. Artur Karl",male,9.0,0,1,C 17368,3.1708,,S,3.0
133,1025,0,3,"Thomas, Mr. Charles P",male,,1,0,2621,6.4375,,C,4.5
116,1008,0,3,"Thomas, Mr. John",male,,0,0,2681,6.4375,,C,4.5


# map()

The map() function in pandas is used to map or replace values in a Series using a dictionary, function, or another Series.

<b>Why use map()</b>

You use map() when you want to transform values in a Series, like replacing values with new ones or applying a custom function to each value.

<b>Syntax:</b>

series.map(arg)

arg: A dictionary, function, or another Series to map values from.

In [33]:
df["Sex"] = df["Sex"].map({"male":1,"female":0})

In [34]:
df.head()

Unnamed: 0,PassengerId,Survived,Passenger_Class,Name,Sex,Age,SibSp,Parent_Children,Ticket,Fare,Port,Age_Category,Fare_Rank
0,892,0,3,"Kelly, Mr. James",1,34,0,0,330911,7.8292,Q,Adult,87.0
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",0,47,1,0,363272,7.0,S,Adult,8.5
2,894,0,2,"Myles, Mr. Thomas Francis",1,62,0,0,240276,9.6875,Q,Old,155.0
3,895,0,3,"Wirz, Mr. Albert",1,27,0,0,315154,8.6625,S,Adult,142.5
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,22,1,1,3101298,12.2875,S,Adult,172.0


In [35]:
df["Port"].value_counts()

Port
S    270
C    102
Q     46
Name: count, dtype: int64

In [36]:
Average_Fare_by_Port = df.groupby(["Port"])["Fare"].mean()
Average_Fare_by_Port 

Port
C    66.259765
Q    10.957700
S    28.179413
Name: Fare, dtype: float64

# replace()

The replace() function in pandas is used to substitute specific values in a DataFrame or Series with other values.

<b>Why use replace()</b>

You use replace() when you want to change certain values in your DataFrame or Series, such as correcting errors, standardizing values, or cleaning data.

<b>Syntax:</b>

df.replace(to_replace, value, inplace=False)

to_replace: The value(s) to be replaced (can be a list, dictionary, or string).

value: The new value(s) to substitute.

inplace=False: If True, modifies the DataFrame in place.

<b>Example:</b>

To replace a specific value:

df.replace('Unknown', 'Not Specified')

In [37]:
df["Port"].replace({"C":1,"S":2,"Q":3},inplace = True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Port"].replace({"C":1,"S":2,"Q":3},inplace = True)
  df["Port"].replace({"C":1,"S":2,"Q":3},inplace = True)


In [67]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


# df.query()

The df.query() function in pandas is used to filter rows of a DataFrame based on a query expression. It allows you to select rows by specifying conditions in a simple and readable way.

<b>Why use query()</b>

You use query() when you want to filter data based on specific conditions without having to use complex indexing or logical operators like & or |.

<b>Syntax:</b>

df.query('expression')

expression: A string that specifies the condition to filter rows by. You can reference column names directly.

In [58]:
result = df.query('Pclass == 1 and Fare > 50')
result


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
12,904,1,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
14,906,1,1,"Chaffee, Mrs. Herbert Fuller (Carrie Constance...",female,47.0,1,0,W.E.P. 5734,61.1750,E31,S
20,912,0,1,"Rothschild, Mr. Martin",male,55.0,1,0,PC 17603,59.4000,,C
23,915,0,1,"Williams, Mr. Richard Norris II",male,21.0,0,1,PC 17597,61.3792,,C
24,916,1,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",female,48.0,1,3,PC 17608,262.3750,B57 B59 B63 B66,C
...,...,...,...,...,...,...,...,...,...,...,...,...
400,1292,1,1,"Bonnell, Miss. Caroline",female,30.0,0,0,36928,164.8667,C7,S
402,1294,1,1,"Gibson, Miss. Dorothy Winifred",female,22.0,0,1,112378,59.4000,,C
407,1299,0,1,"Widener, Mr. George Dunton",male,50.0,1,1,113503,211.5000,C80,C
411,1303,1,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0000,C78,Q


In [60]:
result = df.query('Sex == "female" and Age < 30 and SibSp > 0').head()
result


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
12,904,1,1,"Snyder, Mrs. John Pillsbury (Nelle Stevenson)",female,23.0,1,0,21228,82.2667,B45,S
15,907,1,2,"del Carlo, Mrs. Sebastiano (Argenia Genovesi)",female,24.0,1,0,SC/PARIS 2167,27.7208,,C
18,910,1,3,"Ilmakangas, Miss. Ida Livija",female,27.0,1,0,STON/O2. 3101270,7.925,,S
52,944,1,2,"Hocking, Miss. Ellen Nellie""""",female,20.0,2,1,29105,23.0,,S


In [61]:
result = df.where(df['Age'] > 30)
result


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892.0,0.0,3.0,"Kelly, Mr. James",male,34.5,0.0,0.0,330911,7.8292,,Q
1,893.0,1.0,3.0,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1.0,0.0,363272,7.0000,,S
2,894.0,0.0,2.0,"Myles, Mr. Thomas Francis",male,62.0,0.0,0.0,240276,9.6875,,Q
3,,,,,,,,,,,,
4,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
413,,,,,,,,,,,,
414,1306.0,1.0,1.0,"Oliva y Ocana, Dona. Fermina",female,39.0,0.0,0.0,PC 17758,108.9000,C105,C
415,1307.0,0.0,3.0,"Saether, Mr. Simon Sivertsen",male,38.5,0.0,0.0,SOTON/O.Q. 3101262,7.2500,,S
416,,,,,,,,,,,,


# df.where()

The df.where() function in pandas is used to filter a DataFrame based on a condition, returning the original DataFrame where the condition is True and replacing the False values with NaN (or another specified value).

<b>Why use where()</b>

You use where() when you want to retain the original shape of the DataFrame while filtering based on a condition. This is useful when you want to keep the DataFrame structure intact but only display values that meet certain criteria.

<b>Syntax:</b>

df.where(cond, other=np.nan)

cond: A boolean condition that determines which values to keep.

other: The value to replace False entries with (default is NaN).

In [62]:
result = df.where(df['Age'] > 30)
result


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892.0,0.0,3.0,"Kelly, Mr. James",male,34.5,0.0,0.0,330911,7.8292,,Q
1,893.0,1.0,3.0,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1.0,0.0,363272,7.0000,,S
2,894.0,0.0,2.0,"Myles, Mr. Thomas Francis",male,62.0,0.0,0.0,240276,9.6875,,Q
3,,,,,,,,,,,,
4,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
413,,,,,,,,,,,,
414,1306.0,1.0,1.0,"Oliva y Ocana, Dona. Fermina",female,39.0,0.0,0.0,PC 17758,108.9000,C105,C
415,1307.0,0.0,3.0,"Saether, Mr. Simon Sivertsen",male,38.5,0.0,0.0,SOTON/O.Q. 3101262,7.2500,,S
416,,,,,,,,,,,,


# clip()

The df.clip() function in pandas is used to limit the values in a DataFrame to a specified range. It effectively "clips" the values at the lower and upper thresholds you define.

<b>Why use clip()</b>

You use clip() when you want to ensure that your data falls within a certain range. This is useful for handling outliers by setting a maximum or minimum value.

<b>Syntax:</b>

df.clip(lower=None, upper=None, axis=None)

lower: The minimum value to clip at. Values below this will be set to this lower threshold.

upper: The maximum value to clip at. Values above this will be set to this upper threshold.

axis: Axis along which to clip (default is None, which clips all columns).

In [63]:
result = df['Fare'].clip(lower=10, upper=50)
result

0      10.0000
1      10.0000
2      10.0000
3      10.0000
4      12.2875
        ...   
413    10.0000
414    50.0000
415    10.0000
416    10.0000
417    22.3583
Name: Fare, Length: 418, dtype: float64

# df.shift()

The df.shift() function in pandas is used to shift the values in a DataFrame or Series by a specified number of periods. This function is often used in time series analysis to align data based on a time index.

<b>Why use shift()</b>

You use shift() when you want to create lagged or lead variables for analysis. For example, you might want to compare the current value of a time series with its previous value.

<b>Syntax:</b>

df.shift(periods=1, fill_value=None)

periods: The number of periods to shift. Positive values shift down (forward in time), while negative values shift up (backward in time).

fill_value: The value to use for newly introduced missing values (default is None).

In [64]:
result = df['Fare'].shift(2)
result

0           NaN
1           NaN
2        7.8292
3        7.0000
4        9.6875
         ...   
413     90.0000
414      7.7750
415      8.0500
416    108.9000
417      7.2500
Name: Fare, Length: 418, dtype: float64

# df.rolling()

The df.rolling() function in pandas is used to create a rolling view of a DataFrame or Series, which allows you to perform calculations on a specified window of values. This is commonly used in time series analysis for calculating moving averages or other metrics over a set number of observations.

<b>Why use rolling()</b>

You use rolling() when you want to apply statistical functions over a specific number of consecutive data points, making it easier to analyze trends and patterns over time.

<b>Syntax:</b>

df.rolling(window, min_periods=1, center=False, win_type=None, on=None, closed=None)

window: The size of the moving window (number of periods to include in the calculation).

min_periods: The minimum number of observations required to have a value (default is 1).

center: If True, the labels are centered in the window.

win_type: Specify the type of window (e.g., 'boxcar', 'triang').

on: Column label to use for the rolling window when working with a DataFrame.

closed: Defines which side of the interval is closed.

In [65]:
result = df['Fare'].rolling(window=3).mean()
result

0            NaN
1            NaN
2       8.172233
3       8.450000
4      10.212500
         ...    
413    35.275000
414    41.575000
415    41.400000
416    41.400000
417    12.552767
Name: Fare, Length: 418, dtype: float64

In [66]:
result = df.melt(id_vars=['PassengerId', 'Survived'], value_vars=['Age', 'Fare'])
result

Unnamed: 0,PassengerId,Survived,variable,value
0,892,0,Age,34.5000
1,893,1,Age,47.0000
2,894,0,Age,62.0000
3,895,0,Age,27.0000
4,896,1,Age,22.0000
...,...,...,...,...
831,1305,0,Fare,8.0500
832,1306,1,Fare,108.9000
833,1307,0,Fare,7.2500
834,1308,0,Fare,8.0500


# get_dummies()
The get_dummies() function in pandas is used to convert categorical variables into a format that can be provided to machine learning algorithms (typically by creating dummy/indicator variables).

<b>Why use get_dummies()</b>

You use get_dummies() when you want to transform categorical data into numerical form for machine learning models. It creates a new column for each unique category with binary values (0 or 1).

<b>Syntax:</b>

pd.get_dummies(df, columns=None, drop_first=False)

df: The DataFrame or Series to convert.

columns: Specific columns to convert. If not specified, all categorical columns are converted.

drop_first=False: If True, it drops the first dummy column to avoid multicollinearity (used in regression).

In [39]:
dummy_data = pd.get_dummies(df["Port"],dtype = int)
dummy_data

Unnamed: 0,1,2,3
0,0,0,1
1,0,1,0
2,0,0,1
3,0,1,0
4,0,1,0
...,...,...,...
413,0,1,0
414,1,0,0
415,0,1,0
416,0,1,0


In [40]:
pd.concat([df,dummy_data],axis = 1)

Unnamed: 0,PassengerId,Survived,Passenger_Class,Name,Sex,Age,SibSp,Parent_Children,Ticket,Fare,Port,Age_Category,Fare_Rank,1,2,3
0,892,0,3,"Kelly, Mr. James",1,34,0,0,330911,7.8292,3,Adult,87.0,0,0,1
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",0,47,1,0,363272,7.0000,2,Adult,8.5,0,1,0
2,894,0,2,"Myles, Mr. Thomas Francis",1,62,0,0,240276,9.6875,3,Old,155.0,0,0,1
3,895,0,3,"Wirz, Mr. Albert",1,27,0,0,315154,8.6625,2,Adult,142.5,0,1,0
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,22,1,1,3101298,12.2875,2,Adult,172.0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",1,27,0,0,A.5. 3236,8.0500,2,Adult,128.0,0,1,0
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",0,39,0,0,PC 17758,108.9000,1,Adult,389.0,1,0,0
415,1307,0,3,"Saether, Mr. Simon Sivertsen",1,38,0,0,SOTON/O.Q. 3101262,7.2500,2,Adult,32.0,0,1,0
416,1308,0,3,"Ware, Mr. Frederick",1,27,0,0,359309,8.0500,2,Adult,128.0,0,1,0


# nlargest()

The nlargest() and nsmallest() functions in pandas are used to retrieve the top n largest or smallest values from a Series or DataFrame based on a specific column.

<b>Why use nlargest() and nsmallest()</b>

You use these functions when you want to find the highest or lowest values in your data, such as getting the top-performing employees or the lowest prices.

<b>Syntax for nlargest()</b>

df.nlargest(n, 'column_name')

n: The number of top values to retrieve.

column_name: The column to rank by.

<b>Syntax for nsmallest()</b>

df.nsmallest(n, 'column_name')

n: The number of smallest values to retrieve.

column_name: The column to rank by.

<b>Example:</b>

To get the top 3 highest salaries:

df.nlargest(3, 'Salary')

To get the 2 smallest ages:

df.nsmallest(2, 'Age')

In [41]:
nth_largest = df.nlargest(10,"Fare")
nth_largest

Unnamed: 0,PassengerId,Survived,Passenger_Class,Name,Sex,Age,SibSp,Parent_Children,Ticket,Fare,Port,Age_Category,Fare_Rank
343,1235,1,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",0,58,0,1,PC 17755,512.3292,1,Old,418.0
53,945,1,1,"Fortune, Miss. Ethel Flora",0,28,3,2,19950,263.0,2,Adult,416.5
69,961,1,1,"Fortune, Mrs. Mark (Mary McDougald)",0,60,1,4,19950,263.0,2,Old,416.5
24,916,1,1,"Ryerson, Mrs. Arthur Larned (Emily Maria Borie)",0,48,1,3,PC 17608,262.375,1,Adult,413.0
59,951,1,1,"Chaudanson, Miss. Victorine",0,36,0,0,PC 17608,262.375,1,Adult,413.0
64,956,0,1,"Ryerson, Master. John Borie",1,13,2,2,PC 17608,262.375,1,Children,413.0
142,1034,0,1,"Ryerson, Mr. Arthur Larned",1,61,1,3,PC 17608,262.375,1,Old,413.0
375,1267,1,1,"Bowen, Miss. Grace Scott",0,45,0,0,PC 17608,262.375,1,Adult,413.0
184,1076,1,1,"Douglas, Mrs. Frederick Charles (Mary Helene B...",0,27,1,1,PC 17558,247.5208,1,Adult,410.0
202,1094,0,1,"Astor, Col. John Jacob",1,47,1,0,PC 17757,227.525,1,Adult,409.0


In [42]:
nth_smallest = df.nsmallest(10,"Fare")
nth_smallest

Unnamed: 0,PassengerId,Survived,Passenger_Class,Name,Sex,Age,SibSp,Parent_Children,Ticket,Fare,Port,Age_Category,Fare_Rank
266,1158,0,1,"Chisholm, Mr. Roderick Robert Crispin",1,27,0,0,112051,0.0,2,Adult,1.5
372,1264,0,1,"Ismay, Mr. Joseph Bruce",1,49,0,0,112058,0.0,2,Adult,1.5
21,913,0,3,"Olsen, Master. Artur Karl",1,9,0,1,C 17368,3.1708,2,Children,3.0
116,1008,0,3,"Thomas, Mr. John",1,27,0,0,2681,6.4375,1,Adult,4.5
133,1025,0,3,"Thomas, Mr. Charles P",1,27,1,0,2621,6.4375,1,Adult,4.5
232,1124,0,3,"Wiklund, Mr. Karl Johan",1,21,1,0,3101266,6.4958,2,Adult,6.0
291,1183,1,3,"Daly, Miss. Margaret Marcella Maggie""""",0,30,0,0,382650,6.95,3,Adult,7.0
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",0,47,1,0,363272,7.0,2,Adult,8.5
163,1055,0,3,"Pearce, Mr. Ernest",1,27,0,0,343271,7.0,2,Adult,8.5
211,1103,0,3,"Finoli, Mr. Luigi",1,27,0,0,SOTON/O.Q. 3101308,7.05,2,Adult,10.5


In [43]:
Age_Distribution_Stats = df.groupby(["Age","Survived"])["Age"].mean()
Age_Distribution_Stats

Age  Survived
0    0            0.0
     1            0.0
1    1            1.0
2    0            2.0
     1            2.0
                 ... 
63   1           63.0
64   0           64.0
     1           64.0
67   0           67.0
76   1           76.0
Name: Age, Length: 101, dtype: float64

In [44]:
df.head()

Unnamed: 0,PassengerId,Survived,Passenger_Class,Name,Sex,Age,SibSp,Parent_Children,Ticket,Fare,Port,Age_Category,Fare_Rank
0,892,0,3,"Kelly, Mr. James",1,34,0,0,330911,7.8292,3,Adult,87.0
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",0,47,1,0,363272,7.0,2,Adult,8.5
2,894,0,2,"Myles, Mr. Thomas Francis",1,62,0,0,240276,9.6875,3,Old,155.0
3,895,0,3,"Wirz, Mr. Albert",1,27,0,0,315154,8.6625,2,Adult,142.5
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,22,1,1,3101298,12.2875,2,Adult,172.0
