#### Pandas :

> Open source data analysis and manipulation tool built on top of the Python.

> It provides easy-to-use, high-performance data structures and data analysis tools, primarily for handling tabular data (like spreadsheets or databases) in a structured form.


> Pandas strengthens Python by giving the popular programming language the capability to work with spreadsheet-like data enabling fast loading, aligning, manipulating, and merging



#### Key Features of Pandas:

1. Data alignment: Automatically aligns data based on row/column labels, preventing misalignment issues.

2. Handling missing data: Pandas has built-in methods for handling missing or NaN (Not a Number) values.

3. Flexible reshaping: You can reshape, pivot, and transform data using methods like melt, pivot, and stack.

4. Powerful indexing: Pandas supports both row and column indexing, making it easier to manipulate data efficiently.

5. Merge and join: It provides methods like merge() and join() for combining data from different sources.

6. Groupby functionality: You can group data by one or more columns and apply aggregation functions to it (e.g., sum, mean, count).

7. Time series support: Pandas has excellent support for time series data, allowing easy handling of dates, times, and frequency-based data.


![alt text](pandas_dtype.png "Title")

In [65]:
import pandas as pd

In [66]:
# 1-D array => Series
# 2-D array => Dataframe
# 3-D array => Panel (will be deprecreated in latest version and replaced by Dataframe)

In [67]:
# Series 
series = pd.Series() # Empty Series 
print(series)
print(type(series))

Series([], dtype: object)
<class 'pandas.core.series.Series'>


In [68]:
series = pd.Series([10,20,30,40,50]) 
print(series)
print(type(series))
print(series[1])

series = pd.Series([10,20,30,40,50], index=[1,2,3,4,5]) 
print(series)
print(series[1])
print(series[2:])

0    10
1    20
2    30
3    40
4    50
dtype: int64
<class 'pandas.core.series.Series'>
20
1    10
2    20
3    30
4    40
5    50
dtype: int64
10
3    30
4    40
5    50
dtype: int64


In [69]:
dictS = {'a': 10, 'b':20, 'c':30, 'd':40}
series = pd.Series(dictS)
print(series)
print(series['a'])   # Accesing the values through index labels
print(series[0]) # Accessing through index positions
print(series[['a','c']]) # Accessing the subset of index labels
print(series[:4]) # Slicing the data 

a    10
b    20
c    30
d    40
dtype: int64
10
10
a    10
c    30
dtype: int64
a    10
b    20
c    30
d    40
dtype: int64


  print(series[0]) # Accessing through index positions


In [70]:
seriesConstant = pd.Series(0, index=[1,2,3,4,5,6])
print(seriesConstant)

1    0
2    0
3    0
4    0
5    0
6    0
dtype: int64


In [71]:
# copy 
seriesCopy = seriesConstant.copy()
print(seriesCopy)

1    0
2    0
3    0
4    0
5    0
6    0
dtype: int64


In [72]:
# Dataframe => representation of data in row, col format. mutable

df = pd.DataFrame()
print(df)



Empty DataFrame
Columns: []
Index: []


In [73]:
df = pd.DataFrame([10, 20, 30, 40, 50])  # 1-D list => row = no.of elemnts = 5 , col=1
print(df)

    0
0  10
1  20
2  30
3  40
4  50


In [74]:
df = pd.DataFrame([[10], [30, 40], [50, 60]])  # 2-D list => row = no.of elemnts = 3 , col=2
print(df)

    0     1
0  10   NaN
1  30  40.0
2  50  60.0


In [75]:
df = pd.DataFrame([[10], [30, 40], [50, 60]], index=['r1', 'r2', 'r3'], columns=['col1', 'col2'])  # 2-D list => row = no.of elemnts = 3 , col=2
print(df)

    col1  col2
r1    10   NaN
r2    30  40.0
r3    50  60.0


In [76]:
# Accessing dataframe data

df['col1']   # Accesing the column 'col1'


r1    10
r2    30
r3    50
Name: col1, dtype: int64

In [77]:
df.col1  # Using dot operator, Accesing the column 'col1'

r1    10
r2    30
r3    50
Name: col1, dtype: int64

In [78]:
# Accesing rows 

# .loc and .iloc
# loc => Accessing rows by labels (index labels)
# iloc => Accessing rows by positions (index position)

print(df.loc['r1']) # Access row 1 : loc

print(df.iloc[0]) # Access the first row : iloc

col1    10.0
col2     NaN
Name: r1, dtype: float64
col1    10.0
col2     NaN
Name: r1, dtype: float64


In [79]:
print(df.loc['r1', 'col1']) # Access row 1 and within row 1 access column 1 values: loc

print(df.iloc[0, 0]) # Access fisrt row and within row 1 access column 1 values: : iloc

10
10


In [80]:
# Accessing multiple rows and column values
df.loc['r1':'r2', ['col1']]  # Access multiple rows and for each row fetch fisrt column 

Unnamed: 0,col1
r1,10
r2,30


In [81]:
print(df.iloc[0:1, [0]])
print(df.iloc[0:1, 0])  # Access multiple rows and for each row fetch fisrt column 
print(df.iloc[0:2, 0:2])

    col1
r1    10
r1    10
Name: col1, dtype: int64
    col1  col2
r1    10   NaN
r2    30  40.0


In [82]:
# Reading CSV into dataframe

import pandas as pd

teamdata = pd.read_csv("team.csv") 
teamdata

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
454,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
455,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
456,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
457,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


In [83]:
teamdata.shape

(459, 9)

In [84]:
teamdata.head(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,


In [85]:
teamdata.tail(2)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
457,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0
458,,,,,,,,,


In [86]:
teamdata.index

RangeIndex(start=0, stop=459, step=1)

In [87]:
teamdata.columns

Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

In [88]:
print(teamdata.values)
print(type(teamdata.values))

[['Avery Bradley' 'Boston Celtics' 0.0 ... 180.0 'Texas' 7730337.0]
 ['Jae Crowder' 'Boston Celtics' 99.0 ... 235.0 'Marquette' 6796117.0]
 ['John Holland' 'Boston Celtics' 30.0 ... 205.0 'Boston University' nan]
 ...
 ['Tibor Pleiss' 'Utah Jazz' 21.0 ... 256.0 nan 2900000.0]
 ['Jeff Withey' 'Utah Jazz' 24.0 ... 231.0 'Kansas' 947276.0]
 [nan nan nan ... nan nan nan]]
<class 'numpy.ndarray'>


In [89]:
teamdata.axes

[RangeIndex(start=0, stop=459, step=1),
 Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
        'College', 'Salary'],
       dtype='object')]

In [90]:
teamdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 459 entries, 0 to 458
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      458 non-null    object 
 1   Team      458 non-null    object 
 2   Number    458 non-null    float64
 3   Position  458 non-null    object 
 4   Age       458 non-null    float64
 5   Height    458 non-null    object 
 6   Weight    458 non-null    float64
 7   College   374 non-null    object 
 8   Salary    447 non-null    float64
dtypes: float64(4), object(5)
memory usage: 32.4+ KB


In [91]:
# Access data 

# name = teamdata['Name'] # Single Column
name = teamdata[['Name']] # Single Column
name.head(2)

Unnamed: 0,Name
0,Avery Bradley
1,Jae Crowder


In [92]:
#name = teamdata[['Name', 'Team']] # Multi Column
name = teamdata[["Name", "Team"]] # Multi Column
name.head(2)

Unnamed: 0,Name,Team
0,Avery Bradley,Boston Celtics
1,Jae Crowder,Boston Celtics


In [93]:
teamdata.head(2)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0


In [94]:
teamdata.insert(8, column="Qualification", value="Higher School") # in-place operation
teamdata.head(2)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Qualification,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,Higher School,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,Higher School,6796117.0


In [95]:
teamdata['Qualification'] = 'Senior Secondary'
teamdata.head(2)



Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Qualification,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,Senior Secondary,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,Senior Secondary,6796117.0


In [96]:
teamdata['Salary'].value_counts()

Salary
947276.0      31
845059.0      18
525093.0      13
981348.0       6
16407500.0     5
              ..
2100000.0      1
1252440.0      1
2891760.0      1
3272091.0      1
900000.0       1
Name: count, Length: 310, dtype: int64

In [97]:
teamdata.dropna(how='all', inplace=True)

In [98]:
teamdata

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Qualification,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,Senior Secondary,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,Senior Secondary,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,Senior Secondary,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,Senior Secondary,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,Senior Secondary,5000000.0
...,...,...,...,...,...,...,...,...,...,...
454,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,Senior Secondary,2433333.0
455,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,Senior Secondary,900000.0
456,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,Senior Secondary,2900000.0
457,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,Senior Secondary,947276.0


In [99]:
droppedNA = teamdata.dropna(subset=["Name", "Team"], inplace=False)
droppedNA

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Qualification,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,Senior Secondary,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,Senior Secondary,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,Senior Secondary,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,Senior Secondary,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,Senior Secondary,5000000.0
...,...,...,...,...,...,...,...,...,...,...
453,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,Senior Secondary,2239800.0
454,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,Senior Secondary,2433333.0
455,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,Senior Secondary,900000.0
456,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,Senior Secondary,2900000.0


In [100]:
teamdata.tail(3)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Qualification,Salary
456,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,Senior Secondary,2900000.0
457,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,Senior Secondary,947276.0
458,,,,,,,,,Senior Secondary,


In [101]:
# Pandas option to show all rows and columns

#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_colwidth', None)

# reset the options
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.max_colwidth')

In [102]:
teamdata

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Qualification,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,Senior Secondary,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,Senior Secondary,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,Senior Secondary,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,Senior Secondary,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,Senior Secondary,5000000.0
...,...,...,...,...,...,...,...,...,...,...
454,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,Senior Secondary,2433333.0
455,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,Senior Secondary,900000.0
456,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,Senior Secondary,2900000.0
457,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,Senior Secondary,947276.0


In [103]:
# Replacing NaN with default values, will be deprecated in pandas 3.0
# Issue : https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

#droppedNA["Height"].fillna(0, inplace=True)
#droppedNA["Height"] = droppedNA["Height"].fillna(0)
droppedNA.fillna({"Height": 0}, inplace=True)

droppedNA["Weight"].fillna(0, inplace=True)
droppedNA["Salary"].fillna(0, inplace=True)
droppedNA["College"].fillna("Not Available", inplace=True)


droppedNA



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  droppedNA.fillna({"Height": 0}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  droppedNA["Weight"].fillna(0, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  droppedNA["Weight"].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will n

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Qualification,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,Senior Secondary,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,Senior Secondary,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,Senior Secondary,0.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,Senior Secondary,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,Not Available,Senior Secondary,5000000.0
...,...,...,...,...,...,...,...,...,...,...
453,Trey Lyles,Utah Jazz,41.0,PF,20.0,6-10,234.0,Kentucky,Senior Secondary,2239800.0
454,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,Senior Secondary,2433333.0
455,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,Not Available,Senior Secondary,900000.0
456,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,Not Available,Senior Secondary,2900000.0


In [None]:
# Sorting the data 
# sort_values()
# sort_index()
teamdata.sort_values("Name") # Default ascending order
teamdata.sort_values("Name", ascending=False) # Descending
teamdata.sort_values("Name", ascending=True) # ascending order

teamdata.sort_values("Name", ascending=True, na_position="first") # NaNs at the beginning
teamdata.sort_values("Name", ascending=True, na_position="last") # NaNs at the beginning

teamdata.sort_values(["Name", "Team"], ascending=[True, False], na_position="first", inplace=False) # Sorting the data on Name and then on Team
teamdata.sort_values(["Name", "Team"], ascending=[True, True], na_position="first") # Sorting the data on Name and then on Team

teamdata.sort_index(ascending=True)
teamdata.sort_index(ascending=False) # Get the another object after re-ordering, inplace=False
teamdata.sort_index(ascending=False, inplace=True) # Reordering teh original dataframe



Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Qualification,Salary
458,,,,,,,,,Senior Secondary,
457,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,Senior Secondary,947276.0
456,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,Senior Secondary,2900000.0
455,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,Senior Secondary,900000.0
454,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,Senior Secondary,2433333.0
...,...,...,...,...,...,...,...,...,...,...
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,Senior Secondary,5000000.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,Senior Secondary,1148640.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,Senior Secondary,
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,Senior Secondary,6796117.0
