# Masters' in Applied Artificial Intelligence
## Machine Learning Algorithms Course

Notebooks for MLA course

by [*lufer*](mailto:lufer@ipca.pt)

---



# Datasets on ML Modelling - Part I

**Contents**:

1.   **Python essencial**
2.   **Working with Datasets**
3.   Features Manipulation
4.   Essencial Data Analysis
5.   Data Visualization
6.   References




## Environment preparation


### Import necessary Libraries

In [None]:
import pandas as pd
import numpy as np

## 1 - Python Essential


### Mounting Google Drive

In [None]:
#see https://towardsdatascience.com/different-ways-to-connect-google-drive-to-a-google-colab-notebook-pt-1-de03433d2f7a
#see https://adriandolinay.medium.com/an-introduction-to-google-colab-2023-6c26792827b3

from google.colab import drive

# it will ask for your google drive credentiaals
drive.mount('/content/gDrive/', force_remount=True)

### Unmounting Google Drive

In [None]:
#see https://www.geeksforgeeks.org/unmount-drive-in-google-colab-and-remount-to-another-drive/
#!ls "/content/MyDrive/"
#from google.colab import drive
#drive.flush_and_unmount()

### Python Functions Definition

In [None]:
#phyton functions definition
def show(p): print (p)

#using defined fucntions
show("Lets start working with DataSets")

In [None]:
#Importing external python scripts (models)
%cd "/content/gDrive/MyDrive/MIA/Python/"
!pwd
#import
import functions
print(functions.addNumbers(2,3))
print(functions.getCurrentTime())

#Other example
#import time
#print("Sleeping")
#time.sleep(30) # sleep for a while; interrupt me!
#print("Done Sleeping")

### Python Classes and Objects

In [None]:
#Class definition
#lufer
class Person:
  name=""
  age=1
  #-----
  # assign values to object properties, or other operations that are necessary
  # to do when the object is being created
  def __init__(self, name, age):
    self.name = name
    self.age = age
  #-----
  #  controls what should be returned when the class object is represented as a string
  def __str__(self):
    return f"{self.name}({self.age})"
  #-----
  # other methods
  def myfunc(self):
    print("Hello my name is : " + self.name)

#Objects
p1 = Person("Ana Gustavo","12")

print(p1)
print(p1.age)
print (f"Age: ({p1.age}) - Name: ({p1.name}) " )
p1.myfunc();

### Código HTML

In [None]:
%%html
<marquee style='width: 30%; color: red;'><b>O Benfica é o maior!</b></marquee>

### HTML/SVG

In [None]:
%%html
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 450 400" width="200" height="200">
  <rect x="80" y="60" width="250" height="250" rx="20" style="fill:red; stroke:black; fill-opacity:0.7" />
  <rect x="180" y="110" width="250" height="250" rx="40" style="fill:blue; stroke:black; fill-opacity:0.5;" />
</svg>

## 2 - Working with Datasets

### Prepare the environment

In [None]:
#install phyton
#check if python is installed
!python --version # Python 3.9.18

In [None]:
#mount my google onedrive
#from google.colab import drive
#drive.mount('/content/drive')

### Read a Dataset from a CSV file

In [None]:
#option 1
with open('/content/gDrive/MyDrive/MIA/ColabNotebooks/Datasets/test.csv', 'r') as f:
    print(f.read())

In [None]:
#option 2
dataset = pd.read_csv('/content/gDrive/MyDrive/MIA/ColabNotebooks/Datasets/test.csv')


**After read, the dataset is a Dataframe python type.Pandas DataFrame is Excel-like Data.**

In [None]:
type(dataset)

**Number of records in the dataframe**

In [None]:
#number of records in the dataframe
len(dataset)

**Dataframe Structure (rows x columns)**

In [None]:
#Structure (rows x columns) of the dataframe
dataset.shape

**Dataframe Axis (0 and 1)**

In [None]:
dataset.axes

In [None]:
dataset.axes[1]

In [None]:
dataset.index

In [None]:
dataset.values

**First 10 records**

In [None]:
#first 10 records
dataset.head()

**Last 10 records**

In [None]:
#last 10 records
dataset.tail()

**Details about each column of the dataset**

Count of *Data type* and *Non-Null* values *Texto em itálico*

In [None]:
#Details about each column
dataset.info()

In [None]:
pd.set_option("display.precision", 2)

### Basic Statistics over the Dataset

Basic statistics:

*count, mean, staandard deviation (std), minimun, maximun, percentiles (25%,50%,75%)*

**DataFrame.describe(percentiles=None, include=None, exclude=None)**

Descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding *NaN* values.

In [None]:
#see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
dataset.describe()

In [None]:
dataset.describe(percentiles=None, include=None, exclude=None)

Considering the percentiles of *battery_power* column:

*   (25%) - 25% of the records have the value 895 or below
*   (50%) - 50% of the records have the value 1246.5 or below
*   (75%) - 75% of your records have the value 1629.25 or below

**Statistical calculus for a particular column**

*Count number of non-NA/null observations.*

In [None]:
dataset['battery_power'].count()

*Maximum of the battery_power values*

In [None]:
dataset['battery_power'].max()

*Mean of the battery_power values*

In [None]:
dataset['battery_power'].mean()

*Minimum value of battery_power*

In [None]:
dataset['battery_power'].min()

*Median of battery_power column*

In [None]:
dataset['battery_power'].median()

*Check how may occurences for each value of a particulaar column*

In [None]:
dataset['battery_power'].value_counts()

*Select part of the columns*

In [None]:
dataset[['battery_power','dual_sim']]

*Details about part of the columns*

In [None]:
dataset[['battery_power','dual_sim']].describe(include="all")

*Details about a partivular column*

In [None]:
dataset.battery_power.describe()

### Considering columns of type *objet* (text)

By default, only numeric values are considered. To analyse objects columns (text values) use "*dataset.describe(include=object)*"

*Example of Dataframe with objects:*

In [None]:
df = pd.DataFrame({'gender': pd.Categorical(['Male', 'Female', 'Others']),
                   'quantity': [12, 25, 3],
                   'symbol': ['M', 'F', 'O']
                   })
df

*Describing all columns of a DataFrame regardless of data type*

In [None]:
df.describe(include="all")

*Including only string columns in the DataFrame description.*

In [None]:
df.describe(include=object)

*Excluding numeric columns from a DataFrame description.*

In [None]:
df.describe(exclude=[np.number])

*Including only numeric columns in the DataFrame description.*

In [None]:
df.describe(include=[np.number])

### Filtering

*Get filtered data*

Column value greater than...

In [None]:
#Back to original dataset
newdf = dataset.copy()
newdf = dataset.query('battery_power>1000')
newdf

Filtering one column and compare with another

In [None]:
newdf.loc[newdf["battery_power"] >1500, "clock_speed"].value_counts()

Complex filtering

In [None]:
newdf[(newdf["clock_speed"] < 1) & (newdf["dual_sim"] ==1)].groupby(\

    ["battery_power", "mobile_wt","clock_speed"])["int_memory"].count()

*Calculus over filtered data*

In [None]:
dataset.groupby('dual_sim').sum()

In [None]:
dataset.groupby('dual_sim').count()

In [None]:
dataset.groupby("dual_sim", sort=False)["n_cores"].sum()

### References


*   Using pandas and Python to Explore Your Dataset
https://realpython.com/pandas-python-explore-dataset/




End!