In [1]:
import numpy as np
import pandas as pd

## 1.Working with Pandas series

### a) Creating series
Pandas series is a one-dimensional labeled array capable of holding data of any type(integer, string, float, python objects, etc.). The axis labels are collecti vely called index. Labels need not be unique must be harshable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

In [3]:
lst = [10,20,30,40,50]
pd.Series(lst)

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [4]:
arr = np.array([1,2,3,4,5])
pd.Series(arr)

0    1
1    2
2    3
3    4
4    5
dtype: int32

In [5]:
pd.Series(data = ['Eshant', 'Pranjal', 'Jayesh', 'Ashish'], index = [1,2,3,4])

1     Eshant
2    Pranjal
3     Jayesh
4     Ashish
dtype: object

In [6]:
dict = {'day1' : 4000, 'day2' : 6000, 'day3' : 9000}
pd.Series(dict)

day1    4000
day2    6000
day3    9000
dtype: int64

Pandas Series.repeat() function repeat elements of a Series. It returns new series where each element of current series is repeated consecutively a given number of times.

In [7]:
pd.Series(7).repeat(5)

0    7
0    7
0    7
0    7
0    7
dtype: int64

we can use reset function to make index accurate

In [8]:
pd.Series(7).repeat(5).reset_index(drop = True)

0    7
1    7
2    7
3    7
4    7
dtype: int64

The following indicates:
 - 10 should be repeated 3 times and
 - 20 should be repeated 2 times.

In [9]:
s = pd.Series([10,20]).repeat([3,2]).reset_index(drop = True)
s

0    10
1    10
2    10
3    20
4    20
dtype: int64

In [10]:
s[2:4]

2    10
3    20
dtype: int64

### b) Aggregate functio on pandas series
Pandas.Series.aggregate() function aggregate using one or more operations over specified axis in the given series object.

In [11]:
sr = pd.Series([1,2,3,4,5,6,7])
sr.agg([min,max,sum])

min     1
max     7
sum    28
dtype: int64

### c) Series absolute function
Pandas Series.abs() method is used to get the absolute numeric value of each element in Series/DataFrame.

In [12]:
sr = pd.Series([1,-2,-3,4,5,-6,7])
sr.abs()

0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

### d) Appending Series
Pandas Series.append() function is used to concatenate two or more series object.

In [13]:
sr1 = pd.Series([1,-2,-3,4,5,-6,7])
sr2 = pd.Series([1,2,3,4,5,6,7])
sr1.append(sr2).reset_index(drop = True)

  sr1.append(sr2).reset_index(drop = True)


0     1
1    -2
2    -3
3     4
4     5
5    -6
6     7
7     1
8     2
9     3
10    4
11    5
12    6
13    7
dtype: int64

### e) Astype function 
Pandas astype function is used to change data type of a series. When data frame is made from a csv file, the columns are imported and data type is set automatically which many times is not what it actually should have.

In [14]:
type(sr2)

pandas.core.series.Series

In [15]:
sr2.astype('int')

0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int32

### f) Between function
Pandas between() method is used on series to check which values lie between first and second argument.

In [16]:
sr = pd.Series([1,2,3,4,5,6,7])
sr.between(2,7)

0    False
1     True
2     True
3     True
4     True
5     True
6     True
dtype: bool

### g) All strings functions can be used to extract or modify texts in a Series

In [17]:
s = pd.Series(['Artificial Intelligence','Machine Learning','Deep Learning','Natural Language processing','Data Science'])

In [18]:
print(s.str.upper())
print(s.str.lower())

0        ARTIFICIAL INTELLIGENCE
1               MACHINE LEARNING
2                  DEEP LEARNING
3    NATURAL LANGUAGE PROCESSING
4                   DATA SCIENCE
dtype: object
0        artificial intelligence
1               machine learning
2                  deep learning
3    natural language processing
4                   data science
dtype: object


In [19]:
for i in s:
    print(len(i))

23
16
13
27
12


In [20]:
s.str.count('a')

0    1
1    2
2    1
3    4
4    2
dtype: int64

In [21]:
print(s.str.startswith('A'))
print(s.str.endswith('g'))

0     True
1    False
2    False
3    False
4    False
dtype: bool
0    False
1     True
2     True
3     True
4    False
dtype: bool


In [23]:
print(s.to_list())

['Artificial Intelligence', 'Machine Learning', 'Deep Learning', 'Natural Language processing', 'Data Science']


## 2. Detailed coding implementation on Pandas Dataframe
Pandas DataFrame is two-dimensional size-mutable, potentially heterogenous tabular data structure with labeled axis (rows and columns). A data frame is a two-dimensional data structure i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of principle components, the data, rows amd columns.

### a) Creating dataframes
In a real world, a Pandas DataFrame will be created by loading the datasets from the existing storage, storage can be SQL Databases, CSV file and Excel file. Pandas dataframe can be created from lists, dictionary adn from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe:

#### Creating a dataframe using a list
Data frame can be created using a list or list of lists

In [24]:
lst = ['Hi', 'everyone', 'my', 'name', 'is', 'Mayank']
pd.DataFrame(lst)

Unnamed: 0,0
0,Hi
1,everyone
2,my
3,name
4,is
5,Mayank


In [25]:
lst = [['tom',100],['jerry',200],['spike',300]]
pd.DataFrame(lst)

Unnamed: 0,0,1
0,tom,100
1,jerry,200
2,spike,300


#### Creating DataFrame from dict of ndarray/lists:
To create DataFrame from dict of narray/list, all the arrays must be of same length. If the index is passed then the length should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is array length 

In [26]:
data = {'name':['Tom','Nick','Krish','Jack'], 'age':[20,21,19,18]}
pd.DataFrame(data)

Unnamed: 0,name,age
0,Tom,20
1,Nick,21
2,Krish,19
3,Jack,18


#### A dataframe is a two-dimensional data structure i.e., data is aligned in a tabular fashion in rows and columns. We can perform basic operations on rows/columns like selecting, deleting, adding and renaming.
Column Selection : In order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

In [29]:
data = {'Name'          : ['Jai','Prince','Gaurav','Anuj'],
        'Age'           : [27,24,22,32],
        'Address'       : ['Delhi','Kanpur','Allahbad','Kannauj'],
        'Qualification' : ['Msc','MA','MCA','Phd']}
df = pd.DataFrame(data)
df[['Name','Qualification']]

Unnamed: 0,Name,Qualification
0,Jai,Msc
1,Prince,MA
2,Gaurav,MCA
3,Anuj,Phd


### b) Slicing in DataFrames Using Iloc and Loc
Pandas comprises many methods for its proper functioning. loc() and iloc() are one of those methods. These are used in slicing data from the Pandas DataFrame. They help in the convenient selection of data from the DataFrame in Pyhton. They are used in filtering the data according to some conditions.

In [3]:
data = {'one'   : pd.Series([1,2,3,4]),
        'two'   : pd.Series([10,20,30,40]),
        'three' : pd.Series([100,200,300,400]),
        'four'  : pd.Series([1000,2000,3000,4000])}
df = pd.DataFrame(data)
df

Unnamed: 0,one,two,three,four
0,1,10,100,1000
1,2,20,200,2000
2,3,30,300,3000
3,4,40,400,4000


#### Basic loc Operations
Python loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it, unlike iloc(), loc() can accept the boolean data iloc(). Many operations can be performed using iloc() method like.

In [10]:
df.loc[1:3,'one':'three']

Unnamed: 0,one,two,three
1,2,20,200
2,3,30,300
3,4,40,400


#### Basic iloc Operations
The iloc() function is an indexed-based selecting method which means that we have to pass an integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it unlike loc(), iloc() does not accept the boolean data unlike loc().

In [18]:
df.iloc[[0,2],[1,3]]

Unnamed: 0,two,four
0,10,1000
2,30,3000


In [19]:
df.iloc[1:3]

Unnamed: 0,one,two,three,four
1,2,20,200,2000
2,3,30,300,3000


### c) Slicing using condition
Using conditions works with loc basically

In [24]:
df.loc[df['two'] > 20,['three','four']]

Unnamed: 0,three,four
2,300,3000
3,400,4000


In [25]:
df = pd.DataFrame(data)
df

Unnamed: 0,one,two,three,four
0,1,10,100,1000
1,2,20,200,2000
2,3,30,300,3000
3,4,40,400,4000


In [26]:
sr = pd.Series([111,222,333,444])
df['five'] = sr
df

Unnamed: 0,one,two,three,four,five
0,1,10,100,1000,111
1,2,20,200,2000,222
2,3,30,300,3000,333
3,4,40,400,4000,444


### d) Column deletion in DataFrames

In [27]:
del df['three']
df

Unnamed: 0,one,two,four,five
0,1,10,1000,111
1,2,20,2000,222
2,3,30,3000,333
3,4,40,4000,444


In [30]:

df

Unnamed: 0,one,two,five
0,1,10,111
1,2,20,222
2,3,30,333
3,4,40,444


In [31]:
df

Unnamed: 0,one,two,five
0,1,10,111
1,2,20,222
2,3,30,333
3,4,40,444


### e) Addition of rows
In n pandas DataFrame, you can add rows by using append method. You can also create a new DataFrame with the desired row values and use the append to add new row to the original dataframe. Here's an example of adding a single row to a dataframe.

In [33]:
df1 = pd.DataFrame([[1,2],[3,4]], columns = ['a','b'])
df2 = pd.DataFrame([[1,2],[3,4]], columns = ['a','b'])
df1.append(df2).reset_index(drop = True)

  df1.append(df2).reset_index(drop = True)


Unnamed: 0,a,b
0,1,2
1,3,4
2,1,2
3,3,4


### f) Pandas drop function
Python provide data analyst a way to delete and filter data frame using drop() method. Rows and columns can be removed using index label or column name using this method.

In [2]:
data = {'one'   : pd.Series([1,2,3,4]),
        'two'   : pd.Series([10,20,30,40]),
        'three' : pd.Series([100,200,300,400]),
        'four'  : pd.Series([1000,2000,3000,4000])}
df = pd.DataFrame(data)
df

Unnamed: 0,one,two,three,four
0,1,10,100,1000
1,2,20,200,2000
2,3,30,300,3000
3,4,40,400,4000


In [5]:
df.drop([0,3], axis = 0, inplace = True)

In [7]:
df.drop(['one', 'four'], axis = 1, inplace = True)

In [8]:
df

Unnamed: 0,two,three
1,20,200
2,30,300


### g) Transposing a DataFrame

In [9]:
data = {'one'   : pd.Series([1,2,3,4]),
        'two'   : pd.Series([10,20,30,40]),
        'three' : pd.Series([100,200,300,400]),
        'four'  : pd.Series([1000,2000,3000,4000])}
df = pd.DataFrame(data)
df

Unnamed: 0,one,two,three,four
0,1,10,100,1000
1,2,20,200,2000
2,3,30,300,3000
3,4,40,400,4000


In [10]:
df.T

Unnamed: 0,0,1,2,3
one,1,2,3,4
two,10,20,30,40
three,100,200,300,400
four,1000,2000,3000,4000


### Statistical or Mathematical functions

In [11]:
data = {'one'   : pd.Series([1,2,3,4]),
        'two'   : pd.Series([10,20,30,40]),
        'three' : pd.Series([100,200,300,400]),
        'four'  : pd.Series([1000,2000,3000,4000])}
df = pd.DataFrame(data)
df

Unnamed: 0,one,two,three,four
0,1,10,100,1000
1,2,20,200,2000
2,3,30,300,3000
3,4,40,400,4000


In [12]:
df.sum()

one         10
two        100
three     1000
four     10000
dtype: int64

In [13]:
df.mean()

one         2.5
two        25.0
three     250.0
four     2500.0
dtype: float64

In [14]:
df.median()

one         2.5
two        25.0
three     250.0
four     2500.0
dtype: float64

In [19]:
df.var()

one      1.666667e+00
two      1.666667e+02
three    1.666667e+04
four     1.666667e+06
dtype: float64

In [18]:
df.std()

one         1.290994
two        12.909944
three     129.099445
four     1290.994449
dtype: float64

In [20]:
df.min()

one         1
two        10
three     100
four     1000
dtype: int64

In [21]:
df.max()

one         4
two        40
three     400
four     4000
dtype: int64

 ### Describe function

In [22]:
df

Unnamed: 0,one,two,three,four
0,1,10,100,1000
1,2,20,200,2000
2,3,30,300,3000
3,4,40,400,4000


In [23]:
df.describe()

Unnamed: 0,one,two,three,four
count,4.0,4.0,4.0,4.0
mean,2.5,25.0,250.0,2500.0
std,1.290994,12.909944,129.099445,1290.994449
min,1.0,10.0,100.0,1000.0
25%,1.75,17.5,175.0,1750.0
50%,2.5,25.0,250.0,2500.0
75%,3.25,32.5,325.0,3250.0
max,4.0,40.0,400.0,4000.0


### Pipe functions

In [2]:
data = {'one'   : pd.Series([1,2,3,4]),
        'two'   : pd.Series([10,20,30,40]),
        'three' : pd.Series([100,200,300,400]),
        'four'  : pd.Series([1000,2000,3000,4000])}
df = pd.DataFrame(data)
df

Unnamed: 0,one,two,three,four
0,1,10,100,1000
1,2,20,200,2000
2,3,30,300,3000
3,4,40,400,4000


In [3]:
def add_(i,j):
    return i+j
def sub_(i,j):
    return i-j

In [5]:
df.pipe(add_,10), df.pipe(sub_,10)

(   one  two  three  four
 0   11   20    110  1010
 1   12   30    210  2010
 2   13   40    310  3010
 3   14   50    410  4010,
    one  two  three  four
 0   -9    0     90   990
 1   -8   10    190  1990
 2   -7   20    290  2990
 3   -6   30    390  3990)

In [6]:
def mean_(col):
    return col.mean()
def square(i):
    return i*i
df.pipe(mean_).pipe(square)

one         2.5
two        25.0
three     250.0
four     2500.0
dtype: float64

### Working with csv files and basic data analysis using Pandas

#### a) Reading csv
Reading csv file from local system

In [4]:
df = pd.read_csv('Football.csv')

In [5]:
df

Unnamed: 0,Country,League,Club,Player Names,Matches_Played,Substitution,Mins,Goals,xG,xG Per Avg Match,Shots,OnTarget,Shots Per Avg Match,On Target Per Avg Match,Year
0,Spain,La Liga,(BET),Juanmi Callejon,19,16,1849,11,6.62,0.34,48,20,2.47,1.03,2016
1,Spain,La Liga,(BAR),Antoine Griezmann,36,0,3129,16,11.86,0.36,88,41,2.67,1.24,2016
2,Spain,La Liga,(ATL),Luis Suarez,34,1,2940,28,23.21,0.75,120,57,3.88,1.84,2016
3,Spain,La Liga,(CAR),Ruben Castro,32,3,2842,13,14.06,0.47,117,42,3.91,1.40,2016
4,Spain,La Liga,(VAL),Kevin Gameiro,21,10,1745,13,10.65,0.58,50,23,2.72,1.25,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
655,Netherlands,Eredivisie,(UTR),Gyrano Kerk,24,0,2155,10,7.49,0.33,50,18,2.20,0.79,2020
656,Netherlands,Eredivisie,(AJA),Quincy Promes,18,2,1573,12,9.77,0.59,56,30,3.38,1.81,2020
657,Netherlands,Eredivisie,(PSV),Denzel Dumfries,25,0,2363,7,5.72,0.23,45,14,1.81,0.56,2020
658,Netherlands,Eredivisie,,Cyriel Dessers,26,0,2461,15,14.51,0.56,84,43,3.24,1.66,2020


In [6]:
df.head(5)

Unnamed: 0,Country,League,Club,Player Names,Matches_Played,Substitution,Mins,Goals,xG,xG Per Avg Match,Shots,OnTarget,Shots Per Avg Match,On Target Per Avg Match,Year
0,Spain,La Liga,(BET),Juanmi Callejon,19,16,1849,11,6.62,0.34,48,20,2.47,1.03,2016
1,Spain,La Liga,(BAR),Antoine Griezmann,36,0,3129,16,11.86,0.36,88,41,2.67,1.24,2016
2,Spain,La Liga,(ATL),Luis Suarez,34,1,2940,28,23.21,0.75,120,57,3.88,1.84,2016
3,Spain,La Liga,(CAR),Ruben Castro,32,3,2842,13,14.06,0.47,117,42,3.91,1.4,2016
4,Spain,La Liga,(VAL),Kevin Gameiro,21,10,1745,13,10.65,0.58,50,23,2.72,1.25,2016


#### b) Info function

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Country                  660 non-null    object 
 1   League                   660 non-null    object 
 2   Club                     660 non-null    object 
 3   Player Names             660 non-null    object 
 4   Matches_Played           660 non-null    int64  
 5   Substitution             660 non-null    int64  
 6   Mins                     660 non-null    int64  
 7   Goals                    660 non-null    int64  
 8   xG                       660 non-null    float64
 9   xG Per Avg Match         660 non-null    float64
 10  Shots                    660 non-null    int64  
 11  OnTarget                 660 non-null    int64  
 12  Shots Per Avg Match      660 non-null    float64
 13  On Target Per Avg Match  660 non-null    float64
 14  Year                     6

#### c) isnull() function to check if there are nan values present

In [8]:
df.isnull()

Unnamed: 0,Country,League,Club,Player Names,Matches_Played,Substitution,Mins,Goals,xG,xG Per Avg Match,Shots,OnTarget,Shots Per Avg Match,On Target Per Avg Match,Year
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
655,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
656,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
657,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
658,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [9]:
df.isnull().sum()

Country                    0
League                     0
Club                       0
Player Names               0
Matches_Played             0
Substitution               0
Mins                       0
Goals                      0
xG                         0
xG Per Avg Match           0
Shots                      0
OnTarget                   0
Shots Per Avg Match        0
On Target Per Avg Match    0
Year                       0
dtype: int64

#### d) quantile function to get the specific percentile value

In [11]:
df.describe()

Unnamed: 0,Matches_Played,Substitution,Mins,Goals,xG,xG Per Avg Match,Shots,OnTarget,Shots Per Avg Match,On Target Per Avg Match,Year
count,660.0,660.0,660.0,660.0,660.0,660.0,660.0,660.0,660.0,660.0,660.0
mean,22.371212,3.224242,2071.416667,11.810606,10.089606,0.476167,64.177273,28.365152,2.948015,1.315652,2018.363636
std,9.754658,3.839498,900.595049,6.075315,5.724844,0.192831,34.941622,16.363149,0.914906,0.474239,1.3677
min,2.0,0.0,264.0,2.0,0.71,0.07,5.0,2.0,0.8,0.24,2016.0
25%,14.0,0.0,1363.5,8.0,6.1,0.34,37.75,17.0,2.335,0.98,2017.0
50%,24.0,2.0,2245.5,11.0,9.285,0.435,62.0,26.0,2.845,1.25,2019.0
75%,31.0,5.0,2822.0,14.0,13.2525,0.57,86.0,37.0,3.3825,1.54,2019.0
max,38.0,26.0,4177.0,42.0,32.54,1.35,208.0,102.0,7.2,3.63,2020.0


In [12]:
df.describe(percentiles = [.80])

Unnamed: 0,Matches_Played,Substitution,Mins,Goals,xG,xG Per Avg Match,Shots,OnTarget,Shots Per Avg Match,On Target Per Avg Match,Year
count,660.0,660.0,660.0,660.0,660.0,660.0,660.0,660.0,660.0,660.0,660.0
mean,22.371212,3.224242,2071.416667,11.810606,10.089606,0.476167,64.177273,28.365152,2.948015,1.315652,2018.363636
std,9.754658,3.839498,900.595049,6.075315,5.724844,0.192831,34.941622,16.363149,0.914906,0.474239,1.3677
min,2.0,0.0,264.0,2.0,0.71,0.07,5.0,2.0,0.8,0.24,2016.0
50%,24.0,2.0,2245.5,11.0,9.285,0.435,62.0,26.0,2.845,1.25,2019.0
80%,32.0,6.0,2915.8,15.0,14.076,0.61,90.0,39.0,3.6,1.63,2020.0
max,38.0,26.0,4177.0,42.0,32.54,1.35,208.0,102.0,7.2,3.63,2020.0


In [13]:
df['Mins'].quantile(.8)

2915.8

In [14]:
df['Mins'].quantile(.99)

3520.0199999999995

#### e) Copy function

In [15]:
de = df.copy()
de


Unnamed: 0,Country,League,Club,Player Names,Matches_Played,Substitution,Mins,Goals,xG,xG Per Avg Match,Shots,OnTarget,Shots Per Avg Match,On Target Per Avg Match,Year
0,Spain,La Liga,(BET),Juanmi Callejon,19,16,1849,11,6.62,0.34,48,20,2.47,1.03,2016
1,Spain,La Liga,(BAR),Antoine Griezmann,36,0,3129,16,11.86,0.36,88,41,2.67,1.24,2016
2,Spain,La Liga,(ATL),Luis Suarez,34,1,2940,28,23.21,0.75,120,57,3.88,1.84,2016
3,Spain,La Liga,(CAR),Ruben Castro,32,3,2842,13,14.06,0.47,117,42,3.91,1.40,2016
4,Spain,La Liga,(VAL),Kevin Gameiro,21,10,1745,13,10.65,0.58,50,23,2.72,1.25,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
655,Netherlands,Eredivisie,(UTR),Gyrano Kerk,24,0,2155,10,7.49,0.33,50,18,2.20,0.79,2020
656,Netherlands,Eredivisie,(AJA),Quincy Promes,18,2,1573,12,9.77,0.59,56,30,3.38,1.81,2020
657,Netherlands,Eredivisie,(PSV),Denzel Dumfries,25,0,2363,7,5.72,0.23,45,14,1.81,0.56,2020
658,Netherlands,Eredivisie,,Cyriel Dessers,26,0,2461,15,14.51,0.56,84,43,3.24,1.66,2020


In [16]:
de['Year + 100'] = de['Year'] + 100

In [17]:
de

Unnamed: 0,Country,League,Club,Player Names,Matches_Played,Substitution,Mins,Goals,xG,xG Per Avg Match,Shots,OnTarget,Shots Per Avg Match,On Target Per Avg Match,Year,Year + 100
0,Spain,La Liga,(BET),Juanmi Callejon,19,16,1849,11,6.62,0.34,48,20,2.47,1.03,2016,2116
1,Spain,La Liga,(BAR),Antoine Griezmann,36,0,3129,16,11.86,0.36,88,41,2.67,1.24,2016,2116
2,Spain,La Liga,(ATL),Luis Suarez,34,1,2940,28,23.21,0.75,120,57,3.88,1.84,2016,2116
3,Spain,La Liga,(CAR),Ruben Castro,32,3,2842,13,14.06,0.47,117,42,3.91,1.40,2016,2116
4,Spain,La Liga,(VAL),Kevin Gameiro,21,10,1745,13,10.65,0.58,50,23,2.72,1.25,2016,2116
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
655,Netherlands,Eredivisie,(UTR),Gyrano Kerk,24,0,2155,10,7.49,0.33,50,18,2.20,0.79,2020,2120
656,Netherlands,Eredivisie,(AJA),Quincy Promes,18,2,1573,12,9.77,0.59,56,30,3.38,1.81,2020,2120
657,Netherlands,Eredivisie,(PSV),Denzel Dumfries,25,0,2363,7,5.72,0.23,45,14,1.81,0.56,2020,2120
658,Netherlands,Eredivisie,,Cyriel Dessers,26,0,2461,15,14.51,0.56,84,43,3.24,1.66,2020,2120


In [18]:
df['Player Names'].value_counts()

Andrea Belotti     5
Lionel Messi       5
Luis Suarez        5
Andrej Kramaric    5
Ciro Immobile      5
                  ..
Francois Kamano    1
Lebo Mothiba       1
Gaetan Laborde     1
Falcao             1
Cody Gakpo         1
Name: Player Names, Length: 444, dtype: int64

In [19]:
df['Player Names'].unique()

array(['Juanmi Callejon', 'Antoine Griezmann', 'Luis Suarez',
       'Ruben Castro', 'Kevin Gameiro', 'Cristiano Ronaldo',
       'Karim Benzema', 'Neymar ', 'Iago Aspas', 'Sergi Enrich',
       'Aduriz ', 'Sandro Ramlrez', 'Lionel Messi', 'Gerard Moreno',
       'Morata', 'Wissam Ben Yedder', 'Willian Jose', 'Andone ',
       'Cedric Bakambu', 'Isco', 'Mohamed Salah', 'Gregoire Defrel',
       'Ciro Immobile', 'Nikola Kalinic', 'Dries Mertens',
       'Alejandro Gomez', 'Jose CallejOn', 'Iago Falque',
       'Giovanni Simeone', 'Mauro Icardi', 'Diego Falcinelli',
       'Cyril Thereau', 'Edin Dzeko', 'Lorenzo Insigne',
       'Fabio Quagliarella', 'Borriello ', 'Carlos Bacca',
       'Gonzalo Higuain', 'Keita Balde', 'Andrea Belotti', 'Fin Bartels',
       'Lars Stindl', 'Serge Gnabry', 'Wagner ', 'Andrej Kramaric',
       'Florian Niederlechner', 'Robert Lewandowski', 'Emil Forsberg',
       'Timo Werner', 'Nils Petersen', 'Vedad Ibisevic', 'Mario Gomez',
       'Maximilian Philipp',

#### f) drop function

In [23]:
df.drop(['xG'], axis = 1)

Unnamed: 0,Country,League,Club,Player Names,Matches_Played,Substitution,Mins,Goals,xG Per Avg Match,Shots,OnTarget,Shots Per Avg Match,On Target Per Avg Match,Year
0,Spain,La Liga,(BET),Juanmi Callejon,19,16,1849,11,0.34,48,20,2.47,1.03,2016
1,Spain,La Liga,(BAR),Antoine Griezmann,36,0,3129,16,0.36,88,41,2.67,1.24,2016
2,Spain,La Liga,(ATL),Luis Suarez,34,1,2940,28,0.75,120,57,3.88,1.84,2016
3,Spain,La Liga,(CAR),Ruben Castro,32,3,2842,13,0.47,117,42,3.91,1.40,2016
4,Spain,La Liga,(VAL),Kevin Gameiro,21,10,1745,13,0.58,50,23,2.72,1.25,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
655,Netherlands,Eredivisie,(UTR),Gyrano Kerk,24,0,2155,10,0.33,50,18,2.20,0.79,2020
656,Netherlands,Eredivisie,(AJA),Quincy Promes,18,2,1573,12,0.59,56,30,3.38,1.81,2020
657,Netherlands,Eredivisie,(PSV),Denzel Dumfries,25,0,2363,7,0.23,45,14,1.81,0.56,2020
658,Netherlands,Eredivisie,,Cyriel Dessers,26,0,2461,15,0.56,84,43,3.24,1.66,2020


### Report

In [2]:
pip install pandas-profiling


Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install pydantic-settings


Note: you may need to restart the kernel to use updated packages.


In [4]:
import matplotlib
import pandas_profiling as pp


  @nb.jit


PydanticImportError: `BaseSettings` has been moved to the `pydantic-settings` package. See https://docs.pydantic.dev/2.3/migration/#basesettings-has-moved-to-pydantic-settings for more details.

For further information visit https://errors.pydantic.dev/2.3/u/import-error

In [5]:
report = pp.ProfileReport(df)

NameError: name 'pp' is not defined

In [None]:
report