# __Pandas DataFrame__

## __Agenda__

In this lesson, we will cover the following concepts with the help of examples:

- Introduction to Pandas DataFrame
  * Creating a DataFrame from Various Methods
  * Accessing the DataFrame
  * Understanding DataFrame Basics
- Introduction to Statistical Operations in Pandas
  * Descriptive Statistics
  * Mean, Median, and Standard Deviation
  * Correlation Analysis

## __1. Introduction to Pandas DataFrame__

A Pandas DataFrame is a two-dimensional, tabular data structure with labeled axes (Rows and columns). 

It is a primary data structure in the Pandas library, providing a versatile and efficient way to handle and manipulate data in Python.

![image.png](attachment:17113d99-3119-4b69-a615-e4d67afc3b60.png)

### __Key Features:__
- __Tabular structure:__ The DataFrame is organized as a table with rows and columns, similar to a spreadsheet or SQL table.

- __Labeled axes:__ Both rows and columns are labeled, allowing for easy indexing and referencing of data.

- __Heterogeneous data types:__ Each column in a DataFrame can contain different types of data, such as integers, floats, strings, or even complex objects.

- __Versatility:__ DataFrames can store and handle a wide range of data formats, including CSV, Excel, SQL databases, and more.

- __Data alignment:__ Operations on DataFrames are designed to handle missing values gracefully, aligning data based on labels.

### __1.1 Creating a DataFrame from Various Methods__
The creation of a Pandas DataFrame stands as a foundational step in the realm of data analysis and manipulation.
- Diverse methods are available within Pandas to generate a DataFrame, addressing various data sources and structures.
- Data, whether in Python dictionaries, lists, NumPy arrays, or external files such as CSV and Excel, can be seamlessly transformed into a structured tabular format by Pandas.

In [2]:
import pandas as pd

# Creating a DataFrame from a dictionary
data_dict = {'Name': ['Alice', 'Bob', 'Charlie'],
             'Age': [25, 30, 22],
             'Salary': [50000, 60000, 45000]}

df_dict = pd.DataFrame(data_dict,index=['a','b','c'])
df_dict

Unnamed: 0,Name,Age,Salary
a,Alice,25,50000
b,Bob,30,60000
c,Charlie,22,45000


In [3]:
df_dict.index

Index(['a', 'b', 'c'], dtype='object')

In [4]:
df_dict.columns

Index(['Name', 'Age', 'Salary'], dtype='object')

In [5]:
df_dict.values

array([['Alice', 25, 50000],
       ['Bob', 30, 60000],
       ['Charlie', 22, 45000]], dtype=object)

In [6]:
df_dict.ndim

2

In [7]:
df_dict.shape

(3, 3)

In [8]:
df_dict

Unnamed: 0,Name,Age,Salary
a,Alice,25,50000
b,Bob,30,60000
c,Charlie,22,45000


In [9]:
# loc and iloc
df_dict.iloc[1,2]

60000

In [10]:
df_dict.loc['b','Name']

'Bob'

In [11]:
# slicing
df_dict.iloc[0:2]

Unnamed: 0,Name,Age,Salary
a,Alice,25,50000
b,Bob,30,60000


In [12]:
# slicing
df_dict.loc['a':'c']

Unnamed: 0,Name,Age,Salary
a,Alice,25,50000
b,Bob,30,60000
c,Charlie,22,45000


In [13]:
import numpy as np
x = np.random.randint(0,100,size=(4,5))

In [14]:
df = pd.DataFrame(data=x, index=list('abcd'), columns=list('efghi'))
df

Unnamed: 0,e,f,g,h,i
a,16,5,52,69,50
b,42,58,13,8,10
c,65,89,51,60,48
d,61,16,1,12,87


In [15]:
import pandas as pd

# Creating a DataFrame from a dictionary
data_dict = {'Name': ['Alice', 'Bob', 'Charlie'],
             'Age': [25, 30, 22],
             'Salary': [50000, 60000, 45000]}

df_dict = pd.DataFrame(data_dict)
print(df_dict)

# Creating a DataFrame from lists
data_list = [['Alice', 25, 50000], ['Bob', 30, 60000], ['Charlie', 22, 45000]]

# Defining column names
columns = ['Name', 'Age', 'Salary']

df_list = pd.DataFrame(data_list, columns=columns)
print(df_list)

# Creating a DataFrame from a NumPy array
import numpy as np
data_array = np.array([['Alice', 25, 50000],
                       ['Bob', 30, 60000],
                       ['Charlie', 22, 45000]])

df_array = pd.DataFrame(data_array, columns=columns)
print(df_array)

# # Creating a DataFrame from a CSV file
# df_csv = pd.read_csv('HousePrices.csv')
# print(df_csv)

# # Creating a DataFrame from an Excel file
# df_excel = pd.read_excel('Iris.xlsx')
# print(df_excel)

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   22   45000
      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   22   45000
      Name Age Salary
0    Alice  25  50000
1      Bob  30  60000
2  Charlie  22  45000


In [25]:
df2 = pd.read_csv("HousePrices.csv")
df2

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,3.130000e+05,3.0,1.50,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2.384000e+06,5.0,2.50,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,3.420000e+05,3.0,2.00,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,4.200000e+05,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,5.500000e+05,4.0,2.50,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4595,2014-07-09 00:00:00,3.081667e+05,3.0,1.75,1510,6360,1.0,0,0,4,1510,0,1954,1979,501 N 143rd St,Seattle,WA 98133,USA
4596,2014-07-09 00:00:00,5.343333e+05,3.0,2.50,1460,7573,2.0,0,0,3,1460,0,1983,2009,14855 SE 10th Pl,Bellevue,WA 98007,USA
4597,2014-07-09 00:00:00,4.169042e+05,3.0,2.50,3010,7014,2.0,0,0,3,3010,0,2009,0,759 Ilwaco Pl NE,Renton,WA 98059,USA
4598,2014-07-10 00:00:00,2.034000e+05,4.0,2.00,2090,6630,1.0,0,0,3,1070,1020,1974,0,5148 S Creston St,Seattle,WA 98178,USA


In [26]:
df2.dtypes

date              object
price            float64
bedrooms         float64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
street            object
city              object
statezip          object
country           object
dtype: object

In [24]:
df3 = pd.read_excel("Iris.xlsx")
df3

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [18]:
df2['sepal_length']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

In [19]:
df2[['sepal_length','sepal_width']]

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


In [22]:
df2[df2['sepal_length'] < 5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
6,4.6,3.4,1.4,0.3,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa
11,4.8,3.4,1.6,0.2,setosa
12,4.8,3.0,1.4,0.1,setosa
13,4.3,3.0,1.1,0.1,setosa
22,4.6,3.6,1.0,0.2,setosa


In [27]:
import sys
sys.getsizeof(df2)

2041516

### __1.2 Accessing the DataFrame__

Accessing a Pandas DataFrame involves employing various methods for selecting and retrieving data, whether it be specific columns, rows, or individual cells. 
- Utilizing square brackets, iloc and loc indexers, and conditions, analysts can navigate and extract the necessary information from the DataFrame for further analysis and manipulation. 
- The flexibility of Pandas allows for both label-based and position-based indexing, offering a versatile toolkit for accessing and working with data efficiently.

In [23]:
import pandas as pd

# Creating a sample DataFrame
data = {'Column_name': [5, 15, 8],
        'Column1': [10, 20, 30],
        'Column2': [100, 200, 300],
        'Another_column': [25, 35, 45]}

df = pd.DataFrame(data)

# Accessing a single column
column_data = df['Column_name']
print("Single column:")
print(column_data)

# Accessing multiple columns
selected_columns = df[['Column1', 'Column2']]
print("\nMultiple columns:")
print(selected_columns)

# Accessing a specific row by index
row_data = df.iloc[0]
print("\nSpecific row:")
print(row_data)

# Accessing rows based on a condition
filtered_rows = df[df['Column_name'] > 10]
print("\nFiltered rows:")
print(filtered_rows)

# Accessing a single cell by label
value = df.at[0, 'Column_name']
print("\nSingle cell by label:")
print(value)

# Accessing a single cell by position
value = df.iat[0, 1]  # Row 0, Column 1
print("\nSingle cell by position:")
print(value)

# Accessing data using .loc
selected_data = df.loc[0, 'Column_name']
print("\nData using .loc:")
print(selected_data)

# Conditional access
selected_data = df[df['Column_name'] > 10]['Another_column']
print("\nConditional access:")
print(selected_data)


Single column:
0     5
1    15
2     8
Name: Column_name, dtype: int64

Multiple columns:
   Column1  Column2
0       10      100
1       20      200
2       30      300

Specific row:
Column_name         5
Column1            10
Column2           100
Another_column     25
Name: 0, dtype: int64

Filtered rows:
   Column_name  Column1  Column2  Another_column
1           15       20      200              35

Single cell by label:
5

Single cell by position:
10

Data using .loc:
5

Conditional access:
1    35
Name: Another_column, dtype: int64


### __1.3 Understanding DataFrame Basics__
- The head() and tail() methods enable users to efficiently preview the initial and final rows of a DataFrame, offering a quick snapshot of its structure and content. 
- These functions are invaluable for a preliminary assessment of column names, data types, and potential issues. Additionally, the info() method provides a comprehensive summary, detailing data types, non-null counts, and memory usage, aiding in the identification of missing or inconsistent data. 
- The shape attribute, on the other hand, succinctly communicates the dimensions of the DataFrame, encapsulating the number of rows and columns.
- The syntax for some functions is provided below:

![image.png](attachment:abb1b0c7-34f9-46a3-819c-12d3822c2d18.png)

In [28]:
df = pd.read_csv("IPL IMB381IPL2013.csv")
df

Unnamed: 0,Sl.NO.,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
0,1,"Abdulla, YA",2,SA,KXIP,Allrounder,0,0,0,0.00,...,0.00,0,307,15,20.47,8.90,13.93,2009,50000,50000
1,2,Abdur Razzak,2,BAN,RCB,Bowler,214,18,657,71.41,...,0.00,0,29,0,0.00,14.50,0.00,2008,50000,50000
2,3,"Agarkar, AB",2,IND,KKR,Bowler,571,58,1269,80.62,...,121.01,5,1059,29,36.52,8.81,24.90,2008,200000,350000
3,4,"Ashwin, R",1,IND,CSK,Bowler,284,31,241,84.56,...,76.32,0,1125,49,22.96,6.23,22.14,2011,100000,850000
4,5,"Badrinath, S",2,IND,CSK,Batsman,63,0,79,45.93,...,120.71,28,0,0,0.00,0.00,0.00,2011,100000,800000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125,126,"Yadav, AS",2,IND,DC,Batsman,0,0,0,0.00,...,125.64,2,0,0,0.00,0.00,0.00,2010,50000,750000
126,127,Younis Khan,2,PAK,RR,Batsman,6398,7,6814,75.78,...,42.85,0,0,0,0.00,0.00,0.00,2008,225000,225000
127,128,Yuvraj Singh,2,IND,KXIP+,Batsman,1775,9,8051,87.58,...,131.88,67,569,23,24.74,7.02,21.13,2011,400000,1800000
128,129,Zaheer Khan,2,IND,MI+,Bowler,1114,288,790,73.55,...,91.67,1,1783,65,27.43,7.75,21.26,2008,200000,450000


In [29]:
df.head()

Unnamed: 0,Sl.NO.,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
0,1,"Abdulla, YA",2,SA,KXIP,Allrounder,0,0,0,0.0,...,0.0,0,307,15,20.47,8.9,13.93,2009,50000,50000
1,2,Abdur Razzak,2,BAN,RCB,Bowler,214,18,657,71.41,...,0.0,0,29,0,0.0,14.5,0.0,2008,50000,50000
2,3,"Agarkar, AB",2,IND,KKR,Bowler,571,58,1269,80.62,...,121.01,5,1059,29,36.52,8.81,24.9,2008,200000,350000
3,4,"Ashwin, R",1,IND,CSK,Bowler,284,31,241,84.56,...,76.32,0,1125,49,22.96,6.23,22.14,2011,100000,850000
4,5,"Badrinath, S",2,IND,CSK,Batsman,63,0,79,45.93,...,120.71,28,0,0,0.0,0.0,0.0,2011,100000,800000


In [30]:
df.tail() # last 5 entries of the data

Unnamed: 0,Sl.NO.,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
125,126,"Yadav, AS",2,IND,DC,Batsman,0,0,0,0.0,...,125.64,2,0,0,0.0,0.0,0.0,2010,50000,750000
126,127,Younis Khan,2,PAK,RR,Batsman,6398,7,6814,75.78,...,42.85,0,0,0,0.0,0.0,0.0,2008,225000,225000
127,128,Yuvraj Singh,2,IND,KXIP+,Batsman,1775,9,8051,87.58,...,131.88,67,569,23,24.74,7.02,21.13,2011,400000,1800000
128,129,Zaheer Khan,2,IND,MI+,Bowler,1114,288,790,73.55,...,91.67,1,1783,65,27.43,7.75,21.26,2008,200000,450000
129,130,"Zoysa, DNT",2,SL,DC,Bowler,288,64,343,95.81,...,122.22,0,99,2,49.5,9.0,33.0,2008,100000,110000


In [31]:
df.shape

(130, 26)

In [32]:
df.columns

Index(['Sl.NO.', 'PLAYER NAME', 'AGE', 'COUNTRY', 'TEAM', 'PLAYING ROLE',
       'T-RUNS', 'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B', 'ODI-WKTS', 'ODI-SR-BL',
       'CAPTAINCY EXP', 'RUNS-S', 'HS', 'AVE', 'SR-B', 'SIXERS', 'RUNS-C',
       'WKTS', 'AVE-BL', 'ECON', 'SR-BL', 'AUCTION YEAR', 'BASE PRICE',
       'SOLD PRICE'],
      dtype='object')

In [33]:
df.info() # information about the df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 26 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Sl.NO.         130 non-null    int64  
 1   PLAYER NAME    130 non-null    object 
 2   AGE            130 non-null    int64  
 3   COUNTRY        130 non-null    object 
 4   TEAM           130 non-null    object 
 5   PLAYING ROLE   130 non-null    object 
 6   T-RUNS         130 non-null    int64  
 7   T-WKTS         130 non-null    int64  
 8   ODI-RUNS-S     130 non-null    int64  
 9   ODI-SR-B       130 non-null    float64
 10  ODI-WKTS       130 non-null    int64  
 11  ODI-SR-BL      130 non-null    float64
 12  CAPTAINCY EXP  130 non-null    int64  
 13  RUNS-S         130 non-null    int64  
 14  HS             130 non-null    int64  
 15  AVE            130 non-null    float64
 16  SR-B           130 non-null    float64
 17  SIXERS         130 non-null    int64  
 18  RUNS-C    

In [34]:
# selecting the required columns
df[['PLAYER NAME','COUNTRY','TEAM','PLAYING ROLE']]

Unnamed: 0,PLAYER NAME,COUNTRY,TEAM,PLAYING ROLE
0,"Abdulla, YA",SA,KXIP,Allrounder
1,Abdur Razzak,BAN,RCB,Bowler
2,"Agarkar, AB",IND,KKR,Bowler
3,"Ashwin, R",IND,CSK,Bowler
4,"Badrinath, S",IND,CSK,Batsman
...,...,...,...,...
125,"Yadav, AS",IND,DC,Batsman
126,Younis Khan,PAK,RR,Batsman
127,Yuvraj Singh,IND,KXIP+,Batsman
128,Zaheer Khan,IND,MI+,Bowler


In [35]:
# get the summary, how many players from each country

In [36]:
df.head()

Unnamed: 0,Sl.NO.,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
0,1,"Abdulla, YA",2,SA,KXIP,Allrounder,0,0,0,0.0,...,0.0,0,307,15,20.47,8.9,13.93,2009,50000,50000
1,2,Abdur Razzak,2,BAN,RCB,Bowler,214,18,657,71.41,...,0.0,0,29,0,0.0,14.5,0.0,2008,50000,50000
2,3,"Agarkar, AB",2,IND,KKR,Bowler,571,58,1269,80.62,...,121.01,5,1059,29,36.52,8.81,24.9,2008,200000,350000
3,4,"Ashwin, R",1,IND,CSK,Bowler,284,31,241,84.56,...,76.32,0,1125,49,22.96,6.23,22.14,2011,100000,850000
4,5,"Badrinath, S",2,IND,CSK,Batsman,63,0,79,45.93,...,120.71,28,0,0,0.0,0.0,0.0,2011,100000,800000


In [38]:
# find unique occurance in country column
df['COUNTRY'].value_counts()

IND    53
AUS    22
SA     16
SL     12
PAK     9
NZ      7
WI      6
ENG     3
BAN     1
ZIM     1
Name: COUNTRY, dtype: int64

In [39]:
#Statistical Analysis on the dataset
df.describe()

Unnamed: 0,Sl.NO.,AGE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,ODI-WKTS,ODI-SR-BL,CAPTAINCY EXP,RUNS-S,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
count,130.0,130.0,130.0,130.0,130.0,130.0,130.0,130.0,130.0,130.0,...,130.0,130.0,130.0,130.0,130.0,130.0,130.0,130.0,130.0,130.0
mean,65.5,2.092308,2166.715385,66.530769,2508.738462,71.164385,76.076923,34.033846,0.315385,514.246154,...,111.053462,17.692308,475.523077,17.169231,23.110231,6.204462,17.382615,2009.092308,192230.8,521223.1
std,37.671829,0.576627,3305.646757,142.676855,3582.205625,25.89844,111.20507,26.751749,0.466466,615.226335,...,35.928907,23.828146,558.314049,21.816763,20.802057,4.941531,15.273422,1.377821,153097.3,406807.4
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2008.0,20000.0,20000.0
25%,33.25,2.0,25.5,0.0,73.25,65.65,0.0,0.0,0.0,39.0,...,98.2375,1.0,0.0,0.0,0.0,0.0,0.0,2008.0,100000.0,225000.0
50%,65.5,2.0,542.5,7.0,835.0,78.225,18.5,36.6,0.0,172.0,...,118.51,6.0,297.0,8.5,24.785,7.38,19.935,2008.0,200000.0,437500.0
75%,97.75,2.0,3002.25,47.5,3523.5,86.79,106.0,45.325,1.0,925.25,...,129.1025,29.75,689.25,23.75,35.58,8.2475,26.2125,2011.0,225000.0,700000.0
max,130.0,3.0,15470.0,800.0,18426.0,116.66,534.0,150.0,1.0,2254.0,...,235.49,129.0,1975.0,83.0,126.3,38.11,100.2,2011.0,1350000.0,1800000.0


In [None]:
# Select the player name, sold price, and sort the dataframe based on sold price in descending order

In [41]:
df[['PLAYER NAME', 'SOLD PRICE']].sort_values(by='SOLD PRICE') # ascending order

Unnamed: 0,PLAYER NAME,SOLD PRICE
73,"Noffke, AA",20000
46,Kamran Khan,24000
0,"Abdulla, YA",50000
1,Abdur Razzak,50000
118,Van der Merwe,50000
...,...,...
113,"Tiwary, SS",1600000
111,"Tendulkar, SR",1800000
50,"Kohli, V",1800000
93,"Sehwag, V",1800000


In [42]:
df[['PLAYER NAME', 'SOLD PRICE']].sort_values(by='SOLD PRICE', ascending=False) # ascending order

Unnamed: 0,PLAYER NAME,SOLD PRICE
93,"Sehwag, V",1800000
127,Yuvraj Singh,1800000
50,"Kohli, V",1800000
111,"Tendulkar, SR",1800000
113,"Tiwary, SS",1600000
...,...,...
34,"Henriques, MC",50000
5,"Bailey, GJ",50000
0,"Abdulla, YA",50000
46,Kamran Khan,24000


In [43]:
df[['PLAYER NAME', 'SOLD PRICE']].sort_values(by='SOLD PRICE', ascending=False).head()

Unnamed: 0,PLAYER NAME,SOLD PRICE
93,"Sehwag, V",1800000
127,Yuvraj Singh,1800000
50,"Kohli, V",1800000
111,"Tendulkar, SR",1800000
113,"Tiwary, SS",1600000


In [44]:
df.head()

Unnamed: 0,Sl.NO.,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
0,1,"Abdulla, YA",2,SA,KXIP,Allrounder,0,0,0,0.0,...,0.0,0,307,15,20.47,8.9,13.93,2009,50000,50000
1,2,Abdur Razzak,2,BAN,RCB,Bowler,214,18,657,71.41,...,0.0,0,29,0,0.0,14.5,0.0,2008,50000,50000
2,3,"Agarkar, AB",2,IND,KKR,Bowler,571,58,1269,80.62,...,121.01,5,1059,29,36.52,8.81,24.9,2008,200000,350000
3,4,"Ashwin, R",1,IND,CSK,Bowler,284,31,241,84.56,...,76.32,0,1125,49,22.96,6.23,22.14,2011,100000,850000
4,5,"Badrinath, S",2,IND,CSK,Batsman,63,0,79,45.93,...,120.71,28,0,0,0.0,0.0,0.0,2011,100000,800000


In [45]:
# Which type of player (PLAYING ROLE) would earn more ?
df.groupby("PLAYING ROLE").mean()

  df.groupby("PLAYING ROLE").mean()


Unnamed: 0_level_0,Sl.NO.,AGE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,ODI-WKTS,ODI-SR-BL,CAPTAINCY EXP,RUNS-S,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
PLAYING ROLE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Allrounder,61.028571,2.057143,1702.142857,54.828571,2485.0,74.211714,99.771429,41.694286,0.285714,486.2,...,119.355143,20.085714,579.142857,19.028571,36.769143,8.725714,27.452857,2009.057143,178428.571429,519571.428571
Batsman,70.435897,2.205128,4100.358974,6.512821,4514.615385,74.454359,15.358974,33.435897,0.538462,906.076923,...,119.686154,28.384615,104.769231,3.128205,11.514359,4.035385,8.198205,2009.179487,232051.282051,647435.897436
Bowler,67.227273,2.022727,523.568182,147.090909,317.409091,64.195909,131.727273,36.525,0.068182,63.886364,...,92.05,2.159091,851.409091,32.818182,28.826136,7.813636,22.253636,2009.204545,166931.818182,419977.272727
W. Keeper,56.166667,2.083333,3262.25,0.333333,4093.75,77.135,0.25,4.5,0.583333,973.916667,...,128.463333,32.916667,0.0,0.0,0.0,0.0,0.0,2008.5,195833.333333,487083.333333


In [46]:
df.groupby("PLAYING ROLE").mean()['SOLD PRICE']

  df.groupby("PLAYING ROLE").mean()['SOLD PRICE']


PLAYING ROLE
Allrounder    519571.428571
Batsman       647435.897436
Bowler        419977.272727
W. Keeper     487083.333333
Name: SOLD PRICE, dtype: float64

In [49]:
# Boolean Masking
df[df['SIXERS'] > 75]

Unnamed: 0,Sl.NO.,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
26,27,"Gayle, CH",2,WI,KKR+,Allrounder,6373,72,8087,83.95,...,161.79,129,606,13,46.62,8.05,34.85,2008,250000,800000
28,29,"Gilchrist, AC",3,AUS,DC+,W. Keeper,5570,0,9619,96.94,...,140.21,86,0,0,0.0,0.0,0.0,2008,300000,700000
82,83,"Pathan, YK",2,IND,RR+,Allrounder,0,0,810,113.6,...,149.25,81,1139,36,31.64,7.2,26.36,2008,100000,475000
88,89,"Raina, SK",1,IND,CSK,Batsman,710,13,3525,92.71,...,139.39,97,678,20,33.9,7.05,28.9,2008,125000,650000
93,94,"Sehwag, V",2,IND,DD,Batsman,8178,40,8090,104.68,...,167.32,79,226,6,37.67,10.56,21.67,2011,400000,1800000
97,98,"Sharma, RG",1,IND,DC+,Batsman,0,0,1961,78.85,...,129.17,82,408,14,29.14,8.0,21.86,2008,150000,750000


In [50]:
# Removing the columns
df.head()

Unnamed: 0,Sl.NO.,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
0,1,"Abdulla, YA",2,SA,KXIP,Allrounder,0,0,0,0.0,...,0.0,0,307,15,20.47,8.9,13.93,2009,50000,50000
1,2,Abdur Razzak,2,BAN,RCB,Bowler,214,18,657,71.41,...,0.0,0,29,0,0.0,14.5,0.0,2008,50000,50000
2,3,"Agarkar, AB",2,IND,KKR,Bowler,571,58,1269,80.62,...,121.01,5,1059,29,36.52,8.81,24.9,2008,200000,350000
3,4,"Ashwin, R",1,IND,CSK,Bowler,284,31,241,84.56,...,76.32,0,1125,49,22.96,6.23,22.14,2011,100000,850000
4,5,"Badrinath, S",2,IND,CSK,Batsman,63,0,79,45.93,...,120.71,28,0,0,0.0,0.0,0.0,2011,100000,800000


In [51]:
df2 = df.drop(columns=['Sl.NO.'])

In [52]:
df2.head()

Unnamed: 0,PLAYER NAME,AGE,COUNTRY,TEAM,PLAYING ROLE,T-RUNS,T-WKTS,ODI-RUNS-S,ODI-SR-B,ODI-WKTS,...,SR-B,SIXERS,RUNS-C,WKTS,AVE-BL,ECON,SR-BL,AUCTION YEAR,BASE PRICE,SOLD PRICE
0,"Abdulla, YA",2,SA,KXIP,Allrounder,0,0,0,0.0,0,...,0.0,0,307,15,20.47,8.9,13.93,2009,50000,50000
1,Abdur Razzak,2,BAN,RCB,Bowler,214,18,657,71.41,185,...,0.0,0,29,0,0.0,14.5,0.0,2008,50000,50000
2,"Agarkar, AB",2,IND,KKR,Bowler,571,58,1269,80.62,288,...,121.01,5,1059,29,36.52,8.81,24.9,2008,200000,350000
3,"Ashwin, R",1,IND,CSK,Bowler,284,31,241,84.56,51,...,76.32,0,1125,49,22.96,6.23,22.14,2011,100000,850000
4,"Badrinath, S",2,IND,CSK,Batsman,63,0,79,45.93,0,...,120.71,28,0,0,0.0,0.0,0.0,2011,100000,800000


In [53]:
import pandas as pd

# Create a sample DataFrame
data = {'Column_name': [5, 15, 8],
        'Column1': [10, 20, 30],
        'Column2': [100, 200, 300],
        'Another_column': [25, 35, 45]}

df = pd.DataFrame(data)

# Display the first 2 rows
print("First 2 rows:")
print(df.head(2))

# Display the last row
print("\nLast row:")
print(df.tail(1))

# Provide a comprehensive summary of the DataFrame
print("\nDataFrame summary:")
df.info()

# Return a tuple representing the dimensions of the DataFrame (Rows, columns)
print("\nDataFrame dimensions:")
print(df.shape)


First 2 rows:
   Column_name  Column1  Column2  Another_column
0            5       10      100              25
1           15       20      200              35

Last row:
   Column_name  Column1  Column2  Another_column
2            8       30      300              45

DataFrame summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   Column_name     3 non-null      int64
 1   Column1         3 non-null      int64
 2   Column2         3 non-null      int64
 3   Another_column  3 non-null      int64
dtypes: int64(4)
memory usage: 228.0 bytes

DataFrame dimensions:
(3, 4)


## __2. Introduction to Statistical Operations in Pandas__
Pandas supports the computation of fundamental measures such as mean and median, along with the exploration of correlations and distribution characteristics. 

The following examples illustrate key statistical operations available in Pandas:

### __2.1 Descriptive Statistics__
It offers a snapshot of the dataset's central tendencies and dispersions. 

The describe() function provides a quick summary, including mean, standard deviation, and quartile information.

In [54]:
import pandas as pd

# Create a sample DataFrame with numeric columns
data = {'Numeric_column1': [5, 15, 8],
        'Numeric_column2': [10, 20, 30],
        'Numeric_column3': [100, 200, 300]}

df = pd.DataFrame(data)

# Display descriptive statistics for numeric columns
print("Descriptive statistics for numeric columns:")
print(df.describe())


Descriptive statistics for numeric columns:
       Numeric_column1  Numeric_column2  Numeric_column3
count         3.000000              3.0              3.0
mean          9.333333             20.0            200.0
std           5.131601             10.0            100.0
min           5.000000             10.0            100.0
25%           6.500000             15.0            150.0
50%           8.000000             20.0            200.0
75%          11.500000             25.0            250.0
max          15.000000             30.0            300.0


### __2.1 Mean, Median, and Standard Deviation__

In [55]:
import pandas as pd

# Create a sample DataFrame with numeric columns
data = {'Numeric_column1': [5, 15, 8],
        'Numeric_column2': [10, 20, 30],
        'Numeric_column3': [100, 200, 300]}

df = pd.DataFrame(data)

# Calculate mean, median, and standard deviation
mean_value = df.mean()
median_value = df.median()
std_deviation = df.std()

print("Mean:\n", mean_value)
print("\nMedian:\n", median_value)
print("\nStandard deviation:\n", std_deviation)


Mean:
 Numeric_column1      9.333333
Numeric_column2     20.000000
Numeric_column3    200.000000
dtype: float64

Median:
 Numeric_column1      8.0
Numeric_column2     20.0
Numeric_column3    200.0
dtype: float64

Standard deviation:
 Numeric_column1      5.131601
Numeric_column2     10.000000
Numeric_column3    100.000000
dtype: float64


### __2.2 Correlation Analysis__
The corr() function generates a correlation matrix, indicating how variables relate to each other.

Values closer to 1 or -1 imply a stronger correlation, while values near 0 suggest a weaker correlation.

In [56]:
import pandas as pd

# Create a sample DataFrame with numeric columns
data = {'Numeric_column1': [5, 15, 8],
        'Numeric_column2': [10, 20, 30],
        'Numeric_column3': [100, 200, 300]}

df = pd.DataFrame(data)

# Compute correlation matrix
correlation_matrix = df.corr()

print("Correlation matrix:\n", correlation_matrix)


Correlation matrix:
                  Numeric_column1  Numeric_column2  Numeric_column3
Numeric_column1         1.000000         0.292306         0.292306
Numeric_column2         0.292306         1.000000         1.000000
Numeric_column3         0.292306         1.000000         1.000000


#### __Value Counts__
The value_counts() function tallies the occurrences of unique values in a categorical column, aiding in understanding the distribution of categorical data.

In [None]:
import pandas as pd

# Create a sample DataFrame with a category column
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'C']}
df = pd.DataFrame(data)

# Count occurrences of unique values in the category column
value_counts = df['Category'].value_counts()

print("Value counts:\n", value_counts)


# __Assisted Practice__

## __Problem Statement:__
Analyze a housing dataset using Pandas DataFrame and statistical operations to understand the basic characteristics of the data and the relationships between different variables.

## __Steps to Perform:__
- Load the housing dataset into a Pandas DataFrame
- Familiarize with the DataFrame basics such as its structure, data types of the columns, and summary statistics
- Calculate descriptive statistics like mean, median, and standard deviation for numerical columns such as __LotArea__, __YearBuilt__, __1stFlrSF__, __2ndFlrSF__, and __SalePrice__
- Count the number of occurrences of each category in categorical variables such as __city__, __condition__
- What is the average price per city (groupby) and sort it in descending order

In [58]:
df2[['T-RUNS','T-WKTS','BASE PRICE','SOLD PRICE']].corr()

Unnamed: 0,T-RUNS,T-WKTS,BASE PRICE,SOLD PRICE
T-RUNS,1.0,0.026285,0.437984,0.216752
T-WKTS,0.026285,1.0,0.216648,0.035767
BASE PRICE,0.437984,0.216648,1.0,0.52351
SOLD PRICE,0.216752,0.035767,0.52351,1.0
