# Pandas Vignette

March 30, 2022

Vignette: Pandas, reading csv, DataFrames, filtering, and wrangling data

@author: Oscar A. Trevizo

### References
1. "Pandas General Functions" (accessed Feb. 20, 2022) 
    https://pandas.pydata.org/pandas-docs/stable/reference/general_functions.html
    https://pandas.pydata.org/docs/reference/io.html
    https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
1. "Kaggle" (accessed Mar. 20, 2022)
    https://www.kaggle.com/
    https://www.kaggle.com/datasets/rishidamarla/fifa-players-ratings
1. UCI Machine Learning Repository 
    https://archive.ics.uci.edu/ml/datasets.php
1. Additional references:
    https://stackoverflow.com/tags/pandas/
    https://www.w3schools.com/python/pandas/default.asp
    https://www.geeksforgeeks.org/pandas-tutorial/



# Pandas


## Import libraries

In [1]:
import pandas as pd
import numpy as np

## Load data from a csv

In [2]:
# https://www.kaggle.com/datasets/rishidamarla/fifa-players-ratings
# And specify which columns to load
df = pd.read_csv('fifa_cleaned.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17954 entries, 0 to 17953
Data columns (total 92 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   id                             17954 non-null  int64  
 1   name                           17954 non-null  object 
 2   full_name                      17954 non-null  object 
 3   birth_date                     17954 non-null  object 
 4   age                            17954 non-null  int64  
 5   height_cm                      17954 non-null  float64
 6   weight_kgs                     17954 non-null  float64
 7   positions                      17954 non-null  object 
 8   nationality                    17954 non-null  object 
 9   overall_rating                 17954 non-null  int64  
 10  potential                      17954 non-null  int64  
 11  value_euro                     17699 non-null  float64
 12  wage_euro                      17708 non-null 

In [4]:
# https://www.kaggle.com/datasets/rishidamarla/fifa-players-ratings
# And specify which columns to load
df = pd.read_csv('fifa_cleaned.csv', usecols = ['id', 'name', 'birth_date', 'age',
                                                'height_cm', 'weight_kgs', 'positions',
                                                'club_team', 'national_team', 'overall_rating',
                                                'wage_euro'])

In [5]:
df.head()

Unnamed: 0,id,name,birth_date,age,height_cm,weight_kgs,positions,overall_rating,wage_euro,club_team,national_team
0,158023,L. Messi,1987-06-24,31,170.18,72.1,"CF,RW,ST",94,565000.0,FC Barcelona,Argentina
1,190460,C. Eriksen,1992-02-14,27,154.94,76.2,"CAM,RM,CM",88,205000.0,Tottenham Hotspur,Denmark
2,195864,P. Pogba,1993-03-15,25,190.5,83.9,"CM,CAM",88,255000.0,Manchester United,France
3,198219,L. Insigne,1991-06-04,27,162.56,59.0,"LW,ST",88,165000.0,Napoli,Italy
4,201024,K. Koulibaly,1991-06-20,27,187.96,88.9,CB,88,135000.0,Napoli,


In [6]:
df.shape

(17954, 11)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17954 entries, 0 to 17953
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              17954 non-null  int64  
 1   name            17954 non-null  object 
 2   birth_date      17954 non-null  object 
 3   age             17954 non-null  int64  
 4   height_cm       17954 non-null  float64
 5   weight_kgs      17954 non-null  float64
 6   positions       17954 non-null  object 
 7   overall_rating  17954 non-null  int64  
 8   wage_euro       17708 non-null  float64
 9   club_team       17940 non-null  object 
 10  national_team   857 non-null    object 
dtypes: float64(3), int64(3), object(5)
memory usage: 1.5+ MB


In [8]:
# describe statistics
df.describe()

Unnamed: 0,id,age,height_cm,weight_kgs,overall_rating,wage_euro
count,17954.0,17954.0,17954.0,17954.0,17954.0,17708.0
mean,215411.08778,25.565445,174.946921,75.301047,66.240169,9902.134628
std,29758.387106,4.705708,14.029449,7.083684,6.96373,21995.59375
min,16.0,17.0,152.4,49.9,47.0,1000.0
25%,201117.25,22.0,154.94,69.9,62.0,1000.0
50%,222919.0,25.0,175.26,74.8,66.0,3000.0
75%,237613.5,29.0,185.42,79.8,71.0,9000.0
max,247607.0,46.0,205.74,110.2,94.0,565000.0


## Filters

In [9]:
# %% Filter basics
#
# A filter is based on a boolean test that returns True or False
#
df['wage_euro'] >= 3000

0         True
1         True
2         True
3         True
4         True
         ...  
17949     True
17950    False
17951    False
17952     True
17953    False
Name: wage_euro, Length: 17954, dtype: bool

In [10]:
# %% 1st Approach - Filter simple approach
# The boolean test, True or False can be applied to each row
# Get players with wages above the 50 percentile
df_wage_gt_3K = df[df['wage_euro'] >= 3000]
df_wage_gt_3K.shape

(10216, 11)

In [11]:
# Show top 5 using sort_values by wage and head
df_wage_gt_3K.sort_values(by=['wage_euro'], ascending=False).head()

Unnamed: 0,id,name,birth_date,age,height_cm,weight_kgs,positions,overall_rating,wage_euro,club_team,national_team
0,158023,L. Messi,1987-06-24,31,170.18,72.1,"CF,RW,ST",94,565000.0,FC Barcelona,Argentina
17938,176580,L. Suárez,1987-01-24,32,182.88,86.2,ST,91,455000.0,FC Barcelona,Uruguay
17939,177003,L. Modrić,1985-09-09,33,172.72,66.2,CM,91,420000.0,Real Madrid,
17944,20801,Cristiano Ronaldo,1985-02-05,34,187.96,83.0,"ST,LW",94,405000.0,Juventus,Portugal
17924,173731,G. Bale,1989-07-16,29,185.42,82.1,"RW,LW,ST",88,355000.0,Real Madrid,Wales


In [12]:
# %% Filter select columns and sort one step
#
filter_boolean = df['wage_euro'] >= 3000
# Make a list of column headers
specific_columns = ['name', 'wage_euro', 'club_team', 'national_team']
# Then assign that list to the dataframe and use the boolean to select only true
df_wage_50_less_columns = df[specific_columns][filter_boolean].sort_values(by=['wage_euro'], ascending=False)
df_wage_50_less_columns.head()

Unnamed: 0,name,wage_euro,club_team,national_team
0,L. Messi,565000.0,FC Barcelona,Argentina
17938,L. Suárez,455000.0,FC Barcelona,Uruguay
17939,L. Modrić,420000.0,Real Madrid,
17944,Cristiano Ronaldo,405000.0,Juventus,Portugal
17924,G. Bale,355000.0,Real Madrid,Wales


In [13]:
# %% Filter multiple conditions
#
# Apply two conditions. Get players with wages above the 50 percentile, and national team is Brazil
#
df_wage_50_brazil = df[(df['wage_euro'] >= 3000) & (df['national_team'] == 'Brazil')]
df_wage_50_brazil.shape

(23, 11)

In [14]:
# Show top 5 using sort_values by wage and head
df_wage_50_brazil.sort_values(by=['wage_euro'], ascending=False).head()

Unnamed: 0,id,name,birth_date,age,height_cm,weight_kgs,positions,overall_rating,wage_euro,club_team,national_team
17943,190871,Neymar Jr,1992-02-05,27,175.26,68.0,"LW,CAM",92,290000.0,Paris Saint-Germain,Brazil
17833,230294,Louri Beretta,1992-02-29,27,187.96,83.0,"ST,CF",83,60000.0,Atlético Mineiro,Brazil
17835,230481,Ronaldo Cabrais,1992-02-29,27,152.4,74.8,"RW,CAM",83,51000.0,Grêmio,Brazil
76,230258,Rosberto Dourado,1988-02-29,31,175.26,69.9,"CDM,CM",82,46000.0,Atlético Mineiro,Brazil
17834,230375,Josué Chiamulera,1992-02-29,27,185.42,79.8,CB,83,43000.0,Grêmio,Brazil


In [15]:
# %% Filter unique select one columns (a list)
#
# Need 
filter_boolean = df['wage_euro'] >= 3000
# Then assign that list to the dataframe and use the boolean to select only true
# And dropna()
df_top_natl_tm = df['national_team'][filter_boolean].dropna()

df_top_natl_tm = df_top_natl_tm.unique()    # returns a numpy array series
type(df_top_natl_tm)

numpy.ndarray

In [16]:
# Displaying the numpy array series
df_top_natl_tm

array(['Argentina', 'Denmark', 'France', 'Italy', 'Netherlands',
       'Germany', 'Uruguay', 'Spain', 'Belgium', 'Egypt', 'Colombia',
       'Sporting CP', 'Portugal', 'Dalian YiFang FC', 'Mexico', 'Brazil',
       'England', 'Austria', 'Al Hilal', 'Iceland', 'Hungary', 'Wales',
       'Cameroon', "Côte d'Ivoire", 'Australia', 'FC Porto', 'Romania',
       'Chile', 'Norway', 'Venezuela', 'Sweden', 'Scotland', 'Canada',
       'SL Benfica', 'Poland', 'Turkey', 'Northern Ireland',
       'Santos Laguna', 'New Zealand', 'United States',
       'Republic of Ireland', 'Ecuador', 'Pachuca', 'Peru', 'Slovenia',
       'Racing Club', 'FC Red Bull Salzburg', 'Puebla FC',
       'Universidad de Chile', 'Club Tijuana', 'Paraguay', 'Querétaro',
       'South Africa', 'Cruz Azul', 'Switzerland', 'BSC Young Boys',
       'Atlético Nacional', 'Finland', 'Melbourne Victory',
       'Independiente Santa Fe', 'Melbourne City FC',
       'Club Atlético Talleres', 'Urawa Red Diamonds', 'Al Ahli',
       

In [17]:
# Convert numpy array to Pandas series
df_top_natl_tm = pd.Series(df_top_natl_tm)
type(df_top_natl_tm)

pandas.core.series.Series

In [18]:
df_top_natl_tm.head()

0      Argentina
1        Denmark
2         France
3          Italy
4    Netherlands
dtype: object

## Create a DataFrame

In [19]:
# Get the data
first_name = [" Joan", "Mary ", " Vijay ", "Rob ", "Martha", "Josh", " Vicky", " Mario", "Jenny", "Joe"]
last_name = [" T"," K ", " N ", "R ", "L", "F ", " R", " L", "%^", "P"]
score_1 = [91, 83, 95, 72, 91, 85, 89, 82, 'abc', 79]
score_2 = [91, 85, 90, 81, 95, 92, 88, 94, 'xyz', 75]

# Build the dataframe
df = pd.DataFrame({'first_name':first_name, 'last_name':last_name,'score_1':score_1,'score_2':score_2}  )
df.head(10)

Unnamed: 0,first_name,last_name,score_1,score_2
0,Joan,T,91,91
1,Mary,K,83,85
2,Vijay,N,95,90
3,Rob,R,72,81
4,Martha,L,91,95
5,Josh,F,85,92
6,Vicky,R,89,88
7,Mario,L,82,94
8,Jenny,%^,abc,xyz
9,Joe,P,79,75


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   first_name  10 non-null     object
 1   last_name   10 non-null     object
 2   score_1     10 non-null     object
 3   score_2     10 non-null     object
dtypes: object(4)
memory usage: 448.0+ bytes


## To numeric

In [21]:
df['score_1'] = pd.to_numeric(df['score_1'], errors='coerce')

In [22]:
df['score_2'] = pd.to_numeric(df['score_2'], errors='coerce')

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   first_name  10 non-null     object 
 1   last_name   10 non-null     object 
 2   score_1     9 non-null      float64
 3   score_2     9 non-null      float64
dtypes: float64(2), object(2)
memory usage: 448.0+ bytes


In [24]:
df.describe()

Unnamed: 0,score_1,score_2
count,9.0,9.0
mean,85.222222,87.888889
std,7.120003,6.527719
min,72.0,75.0
25%,82.0,85.0
50%,85.0,90.0
75%,91.0,92.0
max,95.0,95.0


## Clean up blank spaces

In [25]:
# Strip right and left spaces
df["first_name"] = df["first_name"].str.strip()
df["last_name"] = df["last_name"].str.strip()
df.head(10)

Unnamed: 0,first_name,last_name,score_1,score_2
0,Joan,T,91.0,91.0
1,Mary,K,83.0,85.0
2,Vijay,N,95.0,90.0
3,Rob,R,72.0,81.0
4,Martha,L,91.0,95.0
5,Josh,F,85.0,92.0
6,Vicky,R,89.0,88.0
7,Mario,L,82.0,94.0
8,Jenny,%^,,
9,Joe,P,79.0,75.0


## Regex clean up characters
http://localhost:8888/notebooks/Python/Jupyter_Vignettes/regex_vignette.ipynb

In [26]:
df["last_name"] = df["last_name"].replace('[^A-Za-z]', np.NaN, regex=True)
df["score_1"] = df["score_1"].replace('[^0-9]', np.NaN, regex=True)
df["score_2"] = df["score_2"].replace('[^0-9]', np.NaN, regex=True)

df.head(10)

Unnamed: 0,first_name,last_name,score_1,score_2
0,Joan,T,91.0,91.0
1,Mary,K,83.0,85.0
2,Vijay,N,95.0,90.0
3,Rob,R,72.0,81.0
4,Martha,L,91.0,95.0
5,Josh,F,85.0,92.0
6,Vicky,R,89.0,88.0
7,Mario,L,82.0,94.0
8,Jenny,,,
9,Joe,P,79.0,75.0


## Drop NaN

In [27]:
df.dropna(inplace=True)
df.head(10)

Unnamed: 0,first_name,last_name,score_1,score_2
0,Joan,T,91.0,91.0
1,Mary,K,83.0,85.0
2,Vijay,N,95.0,90.0
3,Rob,R,72.0,81.0
4,Martha,L,91.0,95.0
5,Josh,F,85.0,92.0
6,Vicky,R,89.0,88.0
7,Mario,L,82.0,94.0
9,Joe,P,79.0,75.0


## Assign data types

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 9
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   first_name  9 non-null      object 
 1   last_name   9 non-null      object 
 2   score_1     9 non-null      float64
 3   score_2     9 non-null      float64
dtypes: float64(2), object(2)
memory usage: 360.0+ bytes


In [29]:
# Based on example from https://www.geeksforgeeks.org/change-data-type-for-one-or-more-columns-in-pandas-dataframe
# Similar reference in https://stackoverflow.com/questions/49684951/pandas-read-csv-dtype-read-all-columns-but-few-as-string
dtypes_dict = {'first_name' : str,
               'last_name' : str,
               'score_1' : int,
               'score_2' : int}
df = df.astype(dtypes_dict)

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 9
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   first_name  9 non-null      object
 1   last_name   9 non-null      object
 2   score_1     9 non-null      int32 
 3   score_2     9 non-null      int32 
dtypes: int32(2), object(2)
memory usage: 288.0+ bytes


## Add column / List comprehension

In [None]:
df['s1_a'] = ['a' if col >= 90 else ' ' for col in df['score_1']]

In [None]:
df.head(10)

In [None]:
df['s2_a'] = ['a' if col >= 90 else ' ' for col in df['score_2']]

In [None]:
df.head(10)

In [None]:
df['improved'] = ['yes' if col2 > col1 else 'no' for col1, col2 in zip(df['score_1'], df['score_2'])]

In [None]:
df.head(10)

##  Filter addl examples

In [None]:
# The filter itself
df["score_1"] == df.score_1.max()

In [None]:
# The filter applied, and to certain columns only
df[['first_name', 'last_name','score_1']][df["score_1"] == df.score_1.max()]

In [None]:
# Another filter, applied, for practice
df[['first_name', 'last_name','score_1', 'score_2']][(df["score_1"] > 90) & (df['score_2'] > 90) ]