# Pandas
`Pandas` is a Python library used for working with data sets.
## Why Use Pandas?
1. Pandas allows us to analyze big data and make conclusions based on statistical theories.

2. Pandas can clean messy data sets, and make them readable and relevant.

3. Relevant data is very important in data science.

## Import Pandas
Once `Pandas` is installed, import it in your applications by adding the `import` keyword:
>`import pandas`<br>

## Pandas as pd
Pandas is usually imported under the pd alias.

Create an alias with the as keyword while importing:
>`import pandas as pd`<br>

Pandas package can be referred to as pd instead of pandas.



In [1]:
#importing pandas
import pandas as pd

## Checking Pandas Version
The version string is stored under __version__ attribute.

In [2]:
#Checking Pandas Version.
print("Pandas version : ", pd.__version__)

Pandas version :  1.5.3


## Series
A Pandas `Series` is like a column in a table.

It is a one-dimensional array holding data of any type.

1D labeled homogeneous array, size immutable.

## Create an Empty Series
A basic series, which can be created is an Empty Series.

In [3]:
series = pd.Series(dtype='object') #we initialized dtype attribute to avoid warning
print(series)

Series([], dtype: object)


In [4]:
#create a list.
series1 = [1,2,3,4]
result1 = pd.Series(series1)
print("Pandas Series")
print(result1)

Pandas Series
0    1
1    2
2    3
3    4
dtype: int64


## Labels
If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

In [5]:
print("Accessing the second index : ", result1[2])

Accessing the second index :  3


## Create Labels
With the index argument, you can name your own labels.

In [6]:
#create a list.
series1 = [1,2,3,4]
result1 = pd.Series(series1, index=['A', 'B', 'C', 'D'])
print("Pandas Series")
print(result1)

Pandas Series
A    1
B    2
C    3
D    4
dtype: int64


## Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.

In [7]:
#Create a dictionary.
blackpink = {'Lisa':'Dancer', 'Jennie':'Rapper', 'Jisoo':'Idol', 'Rose':'Vocalist'}
result2 = pd.Series(blackpink)
print(result2)

Lisa        Dancer
Jennie      Rapper
Jisoo         Idol
Rose      Vocalist
dtype: object


>Note: The keys of the dictionary become the labels.

To select only some of the items in the dictionary, use the `index` argument and specify only the items you want to include in the Series.

In [8]:
#Create a dictionary.
blackpink = {'Lisa':'Dancer', 'Jennie':'Rapper', 'Jisoo':'Idol', 'Rose':'Vocalist'}
result2 = pd.Series(blackpink, index=['Lisa', 'Rose' ])
print(result2)

Lisa      Dancer
Rose    Vocalist
dtype: object


## Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index


In [9]:
s = pd.Series(5,index=[1,2,3,4,5])
print(s)

1    5
2    5
3    5
4    5
5    5
dtype: int64


## DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

Series is like a column, a DataFrame is the whole table.

General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.

## Create an Empty DataFrame
A basic DataFrame, which can be created is an Empty Dataframe.

In [10]:
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


## Create a DataFrame from Lists
The DataFrame can be created using a single list or a list of lists.

In [11]:
list1 = [1,2,3,4]
df1 = pd.DataFrame(list1)
print(df1)

   0
0  1
1  2
2  3
3  4


In [12]:
list2=[
    ['Lisa',27],
    ['Jennie',25],
    ['Rose',28],
    ['Jisoo',29],
   
]
df2 = pd.DataFrame(list2, columns=['Name', 'Age'])
print(df2)

     Name  Age
0    Lisa   27
1  Jennie   25
2    Rose   28
3   Jisoo   29


In [13]:
#create a dataframe with two series.
data = {
    "Name" : ['Lisa', 'Jennie', 'Rose', 'Jisoo'],
    "Age" : [27,28,29,30]
}

result3 = pd.DataFrame(data)
print(result3)

     Name  Age
0    Lisa   27
1  Jennie   28
2    Rose   29
3   Jisoo   30


## Create a DataFrame from List of Dicts
List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.

In [14]:
dict1 = [
    {'a' : 10, 'b' : 15},
    {'a' : 14, 'b' : 51, 'c' : 23}
]
df3 = pd.DataFrame(dict1)
print(df3)

    a   b     c
0  10  15   NaN
1  14  51  23.0


## Locate Row

Pandas use the loc attribute to return one or more specified row(s).

In [15]:
#Return row 0.
print(result3.loc[0])

Name    Lisa
Age       27
Name: 0, dtype: object


In [16]:
#Return row 0 and 2.
print(result3.loc[[0, 2]])

   Name  Age
0  Lisa   27
2  Rose   29


## Named Indexes
With the index argument, you can name your own indexes.

In [17]:
#create a dataframe with two series.
data1 = {
    "Name" : ['Lisa', 'Jennie', 'Rose', 'Jisoo'],
    "Age" : [27,28,29,30]
}

result4 = pd.DataFrame(data1, index=['member1','member2','member3','member4'])
print(result4)

           Name  Age
member1    Lisa   27
member2  Jennie   28
member3    Rose   29
member4   Jisoo   30


## Locate Named Indexes
Use the named index in the loc attribute to return the specified row(s).

In [18]:
# returns the member3.
print(result4.loc["member3"])

Name    Rose
Age       29
Name: member3, dtype: object


## Loading in Data

 The function we're going to use to read in the file is called `pd.read_csv()`

In [72]:
df_csv = pd.read_csv('RegularSeasonCompactResults.csv')
df_csv

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0
...,...,...,...,...,...,...,...,...
145284,2016,132,1114,70,1419,50,N,0
145285,2016,132,1163,72,1272,58,N,0
145286,2016,132,1246,82,1401,77,N,1
145287,2016,132,1277,66,1345,62,N,0


# Basic
## Head and Tail
Function head() to see the first couple rows of the dataframe (or the function tail() to see the last few rows).

In [73]:
df_csv.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


In [74]:
df_csv.tail()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
145284,2016,132,1114,70,1419,50,N,0
145285,2016,132,1163,72,1272,58,N,0
145286,2016,132,1246,82,1401,77,N,1
145287,2016,132,1277,66,1345,62,N,0
145288,2016,132,1386,87,1433,74,N,0


## shape
We can see the dimensions of the dataframe using the the shape attribute

In [75]:
df_csv.shape

(145289, 8)

## column
We can also extract all the column names as a list, by using the columns attribute and can extract the rows with the index attribute

In [76]:
df_csv.columns.tolist()

['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc', 'Numot']

## describe 
In order to get a better idea of the type of data that we are dealing with, we can call the describe() function to see statistics like mean, min, etc about each column of the dataset.

In [77]:
df_csv.describe()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Numot
count,145289.0,145289.0,145289.0,145289.0,145289.0,145289.0,145289.0
mean,2001.574834,75.223816,1286.720646,76.600321,1282.864064,64.497009,0.044387
std,9.233342,33.287418,104.570275,12.173033,104.829234,11.380625,0.247819
min,1985.0,0.0,1101.0,34.0,1101.0,20.0,0.0
25%,1994.0,47.0,1198.0,68.0,1191.0,57.0,0.0
50%,2002.0,78.0,1284.0,76.0,1280.0,64.0,0.0
75%,2010.0,103.0,1379.0,84.0,1375.0,72.0,0.0
max,2016.0,132.0,1464.0,186.0,1464.0,150.0,6.0


## max
The function max() will show you the maximum values of all columns

In [78]:
df_csv.max()

Season    2016
Daynum     132
Wteam     1464
Wscore     186
Lteam     1464
Lscore     150
Wloc         N
Numot        6
dtype: object

If you'd like to specifically get the max value for a particular column, you pass in the name of the column using the bracket indexing operator

In [79]:
df_csv['Season'].max()

2016

## mean
This `mean()` function is used to find the mean of the particular columns.


In [80]:
df_csv['Lscore'].mean()

64.49700940883343

## value_Counts
`value_counts()` function shows how many times each item appears in the column.

In [81]:
df_csv['Season'].value_counts()

2016    5369
2014    5362
2015    5354
2013    5320
2010    5263
2012    5253
2009    5249
2011    5246
2008    5163
2007    5043
2006    4757
2005    4675
2003    4616
2004    4571
2002    4555
2000    4519
2001    4467
1999    4222
1998    4167
1997    4155
1992    4127
1991    4123
1996    4122
1995    4077
1994    4060
1990    4045
1989    4037
1993    3982
1988    3955
1987    3915
1986    3783
1985    3737
Name: Season, dtype: int64

## argmax 
The argmax function in pandas returns the index of the maximum value in a Series. If there are multiple maximum values, the first row position is returned.

In [82]:
df_csv['Wscore'].argmax()

24970

## iloc
The iloc function in pandas is used to access and retrieve data from pandas DataFrame objects using integer-based indexing. The iloc function allows you to access specific rows and columns of a DataFrame by providing the integer-based indices.

In [84]:
df_csv.iloc[[df_csv['Wscore'].argmax()]]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
24970,1991,68,1258,186,1109,140,H,0


In [85]:
df_csv.iloc[[df_csv['Wscore'].argmax()]]['Lscore']

24970    140
Name: Lscore, dtype: int64

>When you see data displayed in the above format, you're dealing with a Pandas Series object, not a dataframe object.

In [86]:
type(df_csv.iloc[[df_csv['Wscore'].argmax()]]['Lscore'])

pandas.core.series.Series

In [87]:
type(df_csv.iloc[[df_csv['Wscore'].argmax()]])

pandas.core.frame.DataFrame

## loc
The loc attribute of a pandas DataFrame is used to select rows and columns by their labels. It can be used to select a single row, a single column, multiple rows, or multiple columns

In [88]:
df_csv.iloc[:3]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0


In [89]:
df_csv.loc[:3]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0


>`Note`:Notice the slight difference in that iloc is exclusive of the second number, while loc is inclusive.

## at
The at() method in pandas is used to get or set a single value from a dataframe based on the row index and column name. It is similar to the loc method, but it is more efficient when you only need to access a single value.


In [90]:
df_csv.loc[df_csv['Wscore'].argmax(),'Lscore']

140

In [92]:
df_csv.at[df_csv['Wscore'].argmax(),'Lscore']

140

## sorting
The sort_values() method in Pandas is used to sort the values in a DataFrame. It takes a column name or a list of column names as an argument and sorts the DataFrame in ascending order by the values in that column. 

In [93]:
df_csv.sort_values('Lscore').head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
100027,2008,66,1203,49,1387,20,H,0
49310,1997,66,1157,61,1204,21,H,0
89021,2006,44,1284,41,1343,21,A,0
85042,2005,66,1131,73,1216,22,H,0
103660,2009,26,1326,59,1359,22,H,0


## groupby
The pandas groupby() function is used to group similar data and helps to perform operations on the grouped data. The pandas groupby count function of python is used to count the number of times a value appears in the data.

In [69]:
import pandas as pd

sample = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': ['a', 'b', 'c', 'd', 'e']})

# Group the DataFrame by the 'A' column
grouped_df = sample.groupby('A')

# Count the number of times each value appears in the 'B' column
counts = grouped_df['B'].count()

# Print the results
print(counts)

A
1    1
2    1
3    1
4    1
5    1
Name: B, dtype: int64


## Data Cleaning
The df.isnull().sum() function in pandas returns the number of missing values in each column of a DataFrame.

The pandas isnull() method is used to check for missing values in a DataFrame. It returns a boolean DataFrame where True indicates a missing value and False indicates a non-missing value.

The df.sum() function in pandas is used to return the sum of the values for the requested axis.

In [94]:
print(df_csv)

        Season  Daynum  Wteam  Wscore  Lteam  Lscore Wloc  Numot
0         1985      20   1228      81   1328      64    N      0
1         1985      25   1106      77   1354      70    H      0
2         1985      25   1112      63   1223      56    H      0
3         1985      25   1165      70   1432      54    H      0
4         1985      25   1192      86   1447      74    H      0
...        ...     ...    ...     ...    ...     ...  ...    ...
145284    2016     132   1114      70   1419      50    N      0
145285    2016     132   1163      72   1272      58    N      0
145286    2016     132   1246      82   1401      77    N      1
145287    2016     132   1277      66   1345      62    N      0
145288    2016     132   1386      87   1433      74    N      0

[145289 rows x 8 columns]


In [95]:
df_csv.isnull()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
145284,False,False,False,False,False,False,False,False
145285,False,False,False,False,False,False,False,False
145286,False,False,False,False,False,False,False,False
145287,False,False,False,False,False,False,False,False


In [97]:
df_csv.sum()

Season                                            290806806
Daynum                                             10929193
Wteam                                             186946356
Wscore                                             11129184
Lteam                                             186386037
Lscore                                              9370706
Wloc      NHHHHHNNHHHNHNHHHAHHAHHHAHHHAHNHAAHHHHHHHAHNHN...
Numot                                                  6449
dtype: object

In [98]:
df_csv.isnull().sum()

Season    0
Daynum    0
Wteam     0
Wscore    0
Lteam     0
Lscore    0
Wloc      0
Numot     0
dtype: int64