# Masters' in Applied Artificial Intelligence
## Machine Learning Algorithms Course

Notebooks for MLA course

by [*lufer*](mailto:lufer@ipca.pt)

---



# Datasets on ML Modelling - Part II

**Contents**:

1.   Python essencial
2.   Working with Datasets
3.   **Features Manipulation**
4.   **Cleaning Data**
5.  Data Visualization


## Environment preparation


### Importing necessary Libraries

In [1]:
import pandas as pd
import numpy as np

Mounting Drive

In [None]:

from google.colab import drive

# it will ask for your google drive credentiaals
drive.mount('/content/gDrive/', force_remount=True)

Mounted at /content/gDrive/


*Loading dataset*

In [2]:
import requests

download_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
target_csv_path = "nbaAll.csv"

#create a local file with remote csv data
response = requests.get(download_url)
response.raise_for_status()
with open(target_csv_path, "wb") as f:
    f.write(response.content)
print("Download ready.")

nba = pd.read_csv("nbaAll.csv")

Download ready.


In [None]:
nba


## 3 - Features Manipulation

In [None]:
#checking dataset structure
nba.shape

In [None]:
nba.head()

*Filtering columns with "isin"*

In [None]:
nbaYear = nba[nba["year_id"].isin([1948, 1949])]
nbaYear

*Get first N columns from a dataframe*

In [None]:
n=3
aux = nba.iloc[:,:n]
renamedNba
aux

*Get last N columns from a dataframe*

In [None]:
aux = nba.iloc[:,-3:]
aux

### Deriving new Feature

*Create new Feature (column)*

In [11]:
nba["date_played"] = pd.to_datetime(nba["date_game"])
nba
nba.columns

Index(['gameorder', 'game_id', 'lg_id', '_iscopy', 'year_id', 'date_game',
       'seasongame', 'is_playoffs', 'team_id', 'fran_id', 'pts', 'elo_i',
       'elo_n', 'win_equiv', 'opp_id', 'opp_fran', 'opp_pts', 'opp_elo_i',
       'opp_elo_n', 'game_location', 'game_result', 'forecast', 'notes',
       'date_played'],
      dtype='object')

*Create new Feature from calculus over others*

In [None]:
#See https://www.plus2net.com/python/pandas-dt-timedelta64.php
from datetime import date
today = pd.to_datetime(date.today())
nba['DaysCPassed'] = (today-nba['date_played']) / np.timedelta64(1, 'D')
nba.shape

(126314, 26)

In [None]:
nba.DaysCPassed.max()

28206.0

### Change features names

In [None]:
renamedNba = nba.rename(columns={"DaysCPassed": "DaysPassed"})

In [None]:
renamedNba.info()
print('-'*50)
nba.info()

### Deleting Features

*Delete a particular Feature (column)*

In [None]:
df = nba.drop(columns=['C'])
df.shape
df

In [None]:
#renamedNba.drop(['C'], inplace=True, axis=1)
renamedNba.info()
print('-'*50)
renamedNba.shape

### Changing the Data Type of Columns

In [None]:
df = nba.copy()
df.info()
df

*Convert column types*

In [None]:
nba["date_played"] = pd.to_datetime(nba["date_game"])

*Identify unique values*

In [None]:
a=df["game_location"].unique()
print(a)

['H' 'A' 'N']


*Counting distinct values*

In [None]:
a=df["game_location"].nunique()
a

3

*Occurences*

In [None]:
df['team_id'].value_counts()

BOS    5997
NYK    5769
LAL    5078
DET    4985
PHI    4533
       ... 
INJ      60
PIT      60
DTF      60
TRH      60
SDS      11
Name: team_id, Length: 104, dtype: int64

Make colunms Category type

In [None]:
t= pd.Categorical(nba['team_id'] )

In [None]:
t

['TRH', 'NYK', 'CHS', 'NYK', 'DTF', ..., 'CLE', 'GSW', 'CLE', 'CLE', 'GSW']
Length: 126314
Categories (104, object): ['ANA', 'AND', 'ATL', 'BAL', ..., 'WAT', 'WSA', 'WSB', 'WSC']

In [None]:
df.info()

In [None]:
df["game_location"] = pd.Categorical(df["game_location"])
df["game_location"].dtype

In [None]:
df.info()

### Aggregations for `DataFrame`


In [None]:
points = nba["pts"]
type(points)
# Expected:
# <class 'pandas.core.series.Series'>

pandas.core.series.Series

In [None]:
points.sum()
# Expected:
# 12976235

12976235

### Grouping

In [None]:
nba.groupby("fran_id", sort=False)["pts"].sum()
# Expected:
# fran_id
# Huskies           3995
# Knicks          582497
# Stags            20398
# Falcons           3797
# Capitols         22387

## 4 - Cleaning Data

Cleaning, Normalizing, initializing are some of the required tasks over any Dataset

In [None]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 26 columns):
 #   Column         Non-Null Count   Dtype          
---  ------         --------------   -----          
 0   gameorder      126314 non-null  int64          
 1   game_id        126314 non-null  object         
 2   lg_id          126314 non-null  object         
 3   _iscopy        126314 non-null  int64          
 4   year_id        126314 non-null  int64          
 5   date_game      126314 non-null  object         
 6   seasongame     126314 non-null  int64          
 7   is_playoffs    126314 non-null  int64          
 8   team_id        126314 non-null  object         
 9   fran_id        126314 non-null  object         
 10  pts            126314 non-null  int64          
 11  elo_i          126314 non-null  float64        
 12  elo_n          126314 non-null  float64        
 13  win_equiv      126314 non-null  float64        
 14  opp_id         126314 non-null  obje

### Missing Values

Avoid *null-values*

The current nba dataset has null  values (*Null/None/ Nan Values*).

The column "*motes*" has only 5424 *non-null* values. All remain columns have 126314 values..

Let analythe the following example:

In [15]:
#import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 31, 22, None, 27],
        'Gender': ['F', 'M', None, 'M', 'F'],
        'Salary': [50000, None, 30000, 40000, 60000]}

df = pd.DataFrame(data)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    5 non-null      object 
 1   Age     4 non-null      float64
 2   Gender  4 non-null      object 
 3   Salary  4 non-null      float64
dtypes: float64(2), object(2)
memory usage: 288.0+ bytes


*It* is easy to realize that *Name* has 5  *non-null* values, but the other columns have only 4.

In [16]:
#preserve original datatset
dfCopy = df.copy()
dfCopy

Unnamed: 0,Name,Age,Gender,Salary
0,Alice,25.0,F,50000.0
1,Bob,31.0,M,
2,Charlie,22.0,,30000.0
3,David,,M,40000.0
4,Eva,27.0,F,60000.0


*Identify Missing Values*

The missing values are converted by default. The functions to identify these missing values are:

*   **isnull()**
*   **notnull()**


The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

"True" means the value is a missing value while "False" means the value is not a missing value.

In [17]:
missing_data = dfCopy.isnull()
missing_data.head(5)

Unnamed: 0,Name,Age,Gender,Salary
0,False,False,False,False
1,False,False,False,True
2,False,False,True,False
3,False,True,False,False
4,False,False,False,False


### Replacing mising values

When *inplace=True* is passed, the data is renamed in place (it returns nothing), so you'd use:

*df.an_operation(inplace=True)*

When *inplace=False* is passed (this is the default value, so isn't necessary), performs the operation and returns a copy of the object, so you'd use:

*df = df.an_operation(inplace=False)*

Replace null values of *Age* feature by *Unknown*

In [None]:
dfCopy["Age"].fillna(\
                     value="Unknown",\
                     inplace=True)

In [None]:
dfCopy

*Replace null values by a particular value*

In [None]:
dfCopy.fillna({'Age':'Unknown', 'Gender': 'Other'}, inplace=True)

In [None]:
dfCopy

In [None]:
display(dfCopy)

Replace the "?" symbol with *NaN* so the dropna() can remove the missing values:

In [None]:
df1=dfCopy.replace('?',np.NaN)

Fill number features with the *mean* value

In [None]:
#reset dfCopy
dfCopy = df.copy()
#dfCopy
#dfCopy.info()

In [None]:
#Using Mode() function to input the values using fillna
dfCopy.fillna({'Salary':dfCopy['Salary'].mean()})

In [None]:
dfCopy

Fill number features with the *mode* value

In [None]:
#Using Mode() function to inpute the values using fillna
dfCopy.fillna({'Salary':dfCopy['Salary'].mode()[0]}, inplace = True)

In [None]:
dfCopy

In [None]:
dfCopy.fillna({'Age':-1, 'Gender':'Other'}, inplace = True)

In [None]:
dfCopy

### See the *null* values

In [None]:
n1 = dfCopy.isnull().any(axis=1)
n1

0    False
1     True
2     True
3     True
4    False
dtype: bool

### Get only the *null* values

In [None]:
nullRows = dfCopy[n1]
nullRows

Unnamed: 0,Name,Age,Gender,Salary
1,Bob,31.0,M,
2,Charlie,22.0,,30000.0
3,David,,M,40000.0


### Get only the *non-null* values

In [None]:
n2 = dfCopy.notnull().all(axis=1)
n2

0     True
1    False
2    False
3    False
4     True
dtype: bool

In [None]:
nonNullRows = dfCopy[n2]
nonNullRows

Unnamed: 0,Name,Age,Gender,Salary
0,Alice,25.0,F,50000.0
4,Eva,27.0,F,60000.0


### Checking *Null Values* using Query Method

In this example, the != operator compare the column values with themselves, which returns *True* if the value is *null*.

In [None]:
nullRows = dfCopy.query('Age != Age or Gender != Gender or Salary != Salary')

In [None]:
nullRows

Unnamed: 0,Name,Age,Gender,Salary
1,Bob,31.0,M,
2,Charlie,22.0,,30000.0
3,David,,M,40000.0


###  Remove rows with missing values

The easiest way to deal with records containing missing values (incomplete records) is to ignore them!


In [None]:
dfCopy.shape
#dfCopy

(5, 4)

In [None]:
#default axis=0 (index==rows)
rowsWithoutMissingData = dfCopy.dropna()

In [None]:
rowsWithoutMissingData.shape

(2, 4)

In [None]:
rowsWithoutMissingData

Unnamed: 0,Name,Age,Gender,Salary
0,Alice,25.0,F,50000.0
4,Eva,27.0,F,60000.0


### Remove *Features* with null-values

Remove problematic columns if they’re not relevant for your analysis.

In [None]:
#Features==Columns (axis 1)
dataWithoutMissingColumns = dfCopy.dropna(axis=1)

In [None]:
dataWithoutMissingColumns

Unnamed: 0,Name
0,Alice
1,Bob
2,Charlie
3,David
4,Eva


In [None]:
nba.info()

### Change *Null Values*

In [None]:
#see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
data_with_default_notes = nba.copy()
data_with_default_notes["notes"].fillna(value="no notes at all", inplace=True)
data_with_default_notes["notes"].describe()
# Expected:
# count              126314
# unique                232
# top       no notes at all
# freq               120890
# Name: notes, dtype: object

count              126314
unique                232
top       no notes at all
freq               120890
Name: notes, dtype: object

### Invalid Values

In [None]:
nba[nba["pts"] == 0]

Unnamed: 0,gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,...,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes,date_played,C,DaysCPassed
26684,13343,197210260VIR,ABA,1,1973,10/26/1972,7,0,DNR,Nuggets,...,2,1484.1907,1487.083,A,L,0.328948,at Richmond VA; forfeit to VIR,1972-10-26,18715 days,18715.0


### Inconsistencies Between Values in Different Columns

In [None]:
nba[(nba["pts"] > nba["opp_pts"]) & (nba["game_result"] != "W")].empty
# Expected:
# True

True

In [None]:
nba[(nba["pts"] < nba["opp_pts"]) & (nba["game_result"] != "L")].empty
# Expected:
# True

True

### Spliting Datasets



---



***Slicing Data (essential)***

In [None]:
a = [7, 2, 3, 7, 5, 6, 0, 1]

the instruction *[start:stop]* includes the value in the position *start*, but not in the position *stop*. Both are optional.

Part of the data

In [None]:
a[1:5]

In [None]:
a[:5]

In [None]:
a[3:]

Start from the end of the data

In [None]:
a[-4:]

In [None]:
a[-6:-2]

*Values substituition*

In [None]:
a[3:4] = [6, 3]
a

*Change the step using [start:stop:step]*

In [None]:
a[::2]

*Special case: Inverting a sequence*

In [None]:
a[::-1]



---



**Considering Datasets**

*Spliting the Dataset by Row*

In [None]:
len(nba)

126314

*Splitting  Dataframe by groups*

Group the data by column value *year_id*. The newly formed dataframe consists of grouped data with *year_id* = 1947.

In [4]:
nba.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   gameorder      126314 non-null  int64  
 1   game_id        126314 non-null  object 
 2   lg_id          126314 non-null  object 
 3   _iscopy        126314 non-null  int64  
 4   year_id        126314 non-null  int64  
 5   date_game      126314 non-null  object 
 6   seasongame     126314 non-null  int64  
 7   is_playoffs    126314 non-null  int64  
 8   team_id        126314 non-null  object 
 9   fran_id        126314 non-null  object 
 10  pts            126314 non-null  int64  
 11  elo_i          126314 non-null  float64
 12  elo_n          126314 non-null  float64
 13  win_equiv      126314 non-null  float64
 14  opp_id         126314 non-null  object 
 15  opp_fran       126314 non-null  object 
 16  opp_pts        126314 non-null  int64  
 17  opp_elo_i      126314 non-nul

In [5]:
# splitting dataframe by groups
# grouping by year
grouped = nba.groupby(nba.year_id)
#get the group of 1947
df_new=grouped.get_group(1947)
df_new

Unnamed: 0,gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,...,win_equiv,opp_id,opp_fran,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes
0,1,194611010TRH,NBA,0,1947,11/1/1946,1,0,TRH,Huskies,...,40.294830,NYK,Knicks,68,1300.0000,1306.7233,H,L,0.640065,
1,1,194611010TRH,NBA,1,1947,11/1/1946,1,0,NYK,Knicks,...,41.705170,TRH,Huskies,66,1300.0000,1293.2767,A,W,0.359935,
2,2,194611020CHS,NBA,0,1947,11/2/1946,1,0,CHS,Stags,...,42.012257,NYK,Knicks,47,1306.7233,1297.0712,H,W,0.631101,
3,2,194611020CHS,NBA,1,1947,11/2/1946,2,0,NYK,Knicks,...,40.692783,CHS,Stags,63,1300.0000,1309.6521,A,L,0.368899,
4,3,194611020DTF,NBA,0,1947,11/2/1946,1,0,DTF,Falcons,...,38.864048,WSC,Capitols,50,1300.0000,1320.3811,H,L,0.640065,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,348,194704190CHS,NBA,1,1947,4/19/1947,68,1,PHW,Warriors,...,55.649139,CHS,Stags,72,1416.6769,1409.4009,A,W,0.389797,
696,349,194704200CHS,NBA,1,1947,4/20/1947,69,1,PHW,Warriors,...,55.359722,CHS,Stags,74,1409.4009,1412.5547,A,L,0.409896,
697,349,194704200CHS,NBA,0,1947,4/20/1947,71,1,CHS,Stags,...,52.490616,PHW,Warriors,73,1446.0986,1442.9448,H,W,0.590104,
698,350,194704220GSW,NBA,1,1947,4/22/1947,72,1,CHS,Stags,...,52.176041,PHW,Warriors,83,1442.9448,1446.1919,A,L,0.320694,


In [None]:
df_new.count()

*Splitting Pandas Dataframe by sized chunks*

Randon 60%

In [6]:
# splitting dataframe in a particular size
df_split = nba.sample(frac=.6)
df_split.reset_index()
#df_split
len(df_split)

75788

Split  dataframe in different sets

In [7]:
#Shuffle the whole dataset first
ds3 = nba.copy()
ds3.sample(frac=1, random_state=42)

Unnamed: 0,gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,...,win_equiv,opp_id,opp_fran,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes
37090,18546,197702020NYN,NBA,0,1977,2/2/1977,48,0,NYN,Nets,...,26.130209,BOS,Celtics,89,1454.9109,1444.0807,H,W,0.483096,
7984,3993,195712030GSW,NBA,0,1958,12/3/1957,17,0,PHW,Warriors,...,39.511757,SYR,Sixers,119,1532.4493,1557.2676,H,L,0.634248,at New York NY
74114,37058,199412030SEA,NBA,0,1995,12/3/1994,15,0,SEA,Thunder,...,55.135208,MIL,Bucks,108,1357.6219,1356.7598,H,W,0.899416,
94323,47162,200302180ORL,NBA,1,2003,2/18/2003,55,0,NOH,Pelicans,...,41.482147,ORL,Magic,99,1455.0071,1460.9988,A,L,0.439981,
37000,18501,197701230WSB,NBA,0,1977,1/23/1977,43,0,WSB,Wizards,...,45.782440,DET,Pistons,108,1518.8035,1512.1829,H,W,0.669820,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119879,59940,201302060OKC,NBA,0,2013,2/6/2013,49,0,OKC,Thunder,...,60.441074,GSW,Warriors,98,1551.1230,1545.6824,H,W,0.807499,
103694,51848,200612070NJN,NBA,1,2007,12/7/2006,17,0,PHO,Suns,...,52.250179,NJN,Nets,157,1479.3242,1473.5780,A,W,0.536483,
860,431,194801060BLB,NBA,1,1948,1/6/1948,21,0,PRO,Steamrollers,...,24.548496,BLB,Baltimore,82,1433.5667,1437.2885,A,L,0.151443,
15795,7898,196712130DET,NBA,1,1968,12/13/1967,31,0,NYK,Knicks,...,35.066292,DET,Pistons,129,1430.0306,1436.9717,A,L,0.327986,


In [None]:
ds3

Split in two dataframes

In [8]:

list_of_dataframes = np.array_split(ds3, 2)
print("First:")
list_of_dataframes[0]
print('-'*100)
print("Second:")
list_of_dataframes[1]

First:
----------------------------------------------------------------------------------------------------
Second:


Unnamed: 0,gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,...,win_equiv,opp_id,opp_fran,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes
63157,31579,199002060SAS,NBA,0,1990,2/6/1990,45,0,SAS,Spurs,...,46.433861,ATL,Hawks,94,1477.7819,1471.8431,H,W,0.698796,
63158,31580,199002070BOS,NBA,1,1990,2/7/1990,44,0,CHH,Pelicans,...,18.800606,BOS,Celtics,146,1526.7374,1529.3101,A,L,0.099172,
63159,31580,199002070BOS,NBA,0,1990,2/7/1990,46,0,BOS,Celtics,...,46.371326,CHH,Pelicans,125,1243.4368,1240.8641,H,W,0.900828,
63160,31581,199002070LAL,NBA,1,1990,2/7/1990,46,0,CHI,Bulls,...,49.018215,LAL,Lakers,121,1673.3429,1679.2725,A,L,0.227701,
63161,31581,199002070LAL,NBA,0,1990,2/7/1990,46,0,LAL,Lakers,...,60.440876,CHI,Bulls,103,1561.1747,1555.2451,H,W,0.772299,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
126309,63155,201506110CLE,NBA,0,2015,6/11/2015,100,1,CLE,Cavaliers,...,60.309792,GSW,Warriors,103,1790.9591,1809.9791,H,L,0.546572,
126310,63156,201506140GSW,NBA,0,2015,6/14/2015,102,1,GSW,Warriors,...,68.013329,CLE,Cavaliers,91,1704.3949,1700.7391,H,W,0.765565,
126311,63156,201506140GSW,NBA,1,2015,6/14/2015,101,1,CLE,Cavaliers,...,60.010067,GSW,Warriors,104,1809.9791,1813.6349,A,L,0.234435,
126312,63157,201506170CLE,NBA,0,2015,6/16/2015,102,1,CLE,Cavaliers,...,59.290245,GSW,Warriors,105,1813.6349,1822.2881,H,L,0.481450,


In [9]:
# spliting dataframe by row index
# last 1000 rows
df1 = nba.iloc[:1000,:]
# first 1000 rows
df2 = nba.iloc[1000:,:]
print(df1)
print("---------------------------")
print(df2)

     gameorder       game_id lg_id  _iscopy  year_id  date_game  seasongame  \
0            1  194611010TRH   NBA        0     1947  11/1/1946           1   
1            1  194611010TRH   NBA        1     1947  11/1/1946           1   
2            2  194611020CHS   NBA        0     1947  11/2/1946           1   
3            2  194611020CHS   NBA        1     1947  11/2/1946           2   
4            3  194611020DTF   NBA        0     1947  11/2/1946           1   
..         ...           ...   ...      ...      ...        ...         ...   
995        498  194802190GSW   NBA        1     1948  2/19/1948          36   
996        499  194802190STB   NBA        1     1948  2/19/1948          39   
997        499  194802190STB   NBA        0     1948  2/19/1948          36   
998        500  194802210PRO   NBA        1     1948  2/21/1948          35   
999        500  194802210PRO   NBA        0     1948  2/21/1948          40   

     is_playoffs team_id       fran_id  ...  win_eq

In [None]:
ds_1

*Spliting by Columns (Features)*

In [10]:
# Split the DataFrame using iloc[] by columns
# first 3 columns
df1 = nba.iloc[:,:3]
# last 3 columns
df2 = nba.iloc[:,3:]
print(df1)
print("---------------------------")
print(df2)

        gameorder       game_id lg_id
0               1  194611010TRH   NBA
1               1  194611010TRH   NBA
2               2  194611020CHS   NBA
3               2  194611020CHS   NBA
4               3  194611020DTF   NBA
...           ...           ...   ...
126309      63155  201506110CLE   NBA
126310      63156  201506140GSW   NBA
126311      63156  201506140GSW   NBA
126312      63157  201506170CLE   NBA
126313      63157  201506170CLE   NBA

[126314 rows x 3 columns]
---------------------------
        _iscopy  year_id  date_game  seasongame  is_playoffs team_id  \
0             0     1947  11/1/1946           1            0     TRH   
1             1     1947  11/1/1946           1            0     NYK   
2             0     1947  11/2/1946           1            0     CHS   
3             1     1947  11/2/1946           2            0     NYK   
4             0     1947  11/2/1946           1            0     DTF   
...         ...      ...        ...         ...          

### Generate a new Dataset
After the dataset analysis it could be necessary to generate a new dataset.

In [None]:
#import os
#print(os.getcwd())

filePath='/content/gDrive/MyDrive/MIA/ColabNotebooks/Datasets'
f=dfCopy.to_csv(filePath+'newDataSet.csv', sep=';', index=False)

### Exporting only a few features

In [None]:
#or
dfCopy.to_csv(filePath+'newDataSet2.csv',columns=['Age','Salary'])

End!