Prepared By

*Asif Newaz, Lecturer, EEE, IUT*



In the previous lab, you received a basic introduction to the Python programming language, including Python data structures.

In this lab, you will explore some essential libraries that are crucial for getting started with Data Science.

# Numpy

NumPy (Numerical Python) is a widely used library for numerical computations in Python. It is much more efficient than lists and supports many more functionalities.

It provides support for large arrays and matrices, along with mathematical functions to operate on these arrays. It works in similiar way that you have learnt in Matlab.

To find similar functions as Matlab - see this documentation. https://numpy.org/doc/stable/user/numpy-for-matlab-users.html

In [None]:
import numpy as np

## **Creating Numpy Array**

In [None]:
# Creating a 1-dimensional array
a = np.array([1, 2, 4, 5])
print(a)

[1 2 4 5]


In [None]:
b= np.array([[2,6,7],[4,2,1]])
b

array([[2, 6, 7],
       [4, 2, 1]])

In [None]:
c= np.array([[2,6,7],[4,2]])
c

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

Numpy array works like MATLAB's matrix. The dimension has to match, unlike python lists.

In [None]:
cl= [[2,6,7],[4,2]]
cl
# this is a list of lists - differnt than multidimensional array

[[2, 6, 7], [4, 2]]

In [None]:
# There are other ways to create numpy array. some of them are familiar to you.
d= np.zeros(3)
print(d)

dd= np.zeros((2,3))
print(dd)

[0. 0. 0.]
[[0. 0. 0.]
 [0. 0. 0.]]


## **Methods**

Numpy array supports a wide range of functionalities that are very useful for different purposes.

In [None]:
dir(a)

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__class_getitem__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__dlpack__',
 '__dlpack_device__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',

In [None]:
argmax(a)

NameError: name 'argmax' is not defined

In [None]:
np.argmax(a)
#  It returns the index of the maximum value in a given array or list

3

In [None]:
np.ravel(b)
# It flatens a multi-dimensional array

array([2, 6, 7, 4, 2, 1])

## **Difference with Lists**

In [None]:
e = np.array([1, 4, 3])
f = np.array([9, 5, 6])

# Element-wise addition
g= e+f
print(g)

# Element-wise multiplication
h= e*f
print(h)

# Dot product
i= np.dot(e,f)
print(i)

#Dot Product
ii =  e @ f
print(ii)

[10  9  9]
[ 9 20 18]
47
47


In [None]:
el = [1, 4, 3]
fl = [9, 5, 6]
print(el+fl)
print(el*fl)

[1, 4, 3, 9, 5, 6]


TypeError: can't multiply sequence by non-int of type 'list'

As you can see, you cannnot do element-wise operations using lists. That also makes lists unsuitable for efficient operations such as broadcasting, vectorization.

## **Indexing**

In [None]:
#Indexing
x= np.array([2,3, 100, -20])
print(x)
print(x[1])

[  2   3 100 -20]
3


In [None]:
x[-1]

-20

In [None]:
x2= np.array([[2,3],[4,5],[6,7]])
print(x2)
print(x2[1])
print(x2[2,0])

[[2 3]
 [4 5]
 [6 7]]
[4 5]
6


In [None]:
x2[:,0]

array([2, 4, 6])

In [None]:
x2[x2>5]
# numpy array also supports boolean indexing

array([6, 7])

In [None]:
x2[:1,:]

array([[2, 3]])

In [None]:
print(x2[:,0])
print(x2[:,0:1])

[2 4 6]
[[2]
 [4]
 [6]]


## **Some common functions**

In [None]:
a = np.array([2, 5, 8])
print(a.ndim)
b= np.array([[1,2,3],[4,5,6]])
print(b.ndim)
print(b.shape)

1
2
(2, 3)


In [None]:
print(a.sum())
print(a.max())
print(a.min())
print(a.mean())
print(b%2)

15
8
2
5.0
[[1 0 1]
 [0 1 0]]


In [None]:
print(b)
b.reshape(3,2)

[[1 2 3]
 [4 5 6]]


array([[1, 2],
       [3, 4],
       [5, 6]])

# Pandas

Pandas is a popular data manipulation library in Python that makes it easy to work with structured data like spreadsheets and databases. What we can do with pandas:


*   Data Loading: Easily load data from various file formats like CSV, Excel, JSON, etc.
*   Data Cleaning: Handle missing data, duplicates, and inconsistencies.
* Data Analysis: Perform statistical analysis and apply aggregation functions.
* Data Manipulation: Filter, sort, group, and reshape data.
* Data Visualization: Create simple plots and integrate with libraries like Matplotlib.



To work with data table, you will need Pandas library. It is fundamental to data science and you will always need it to work with data.

pandas is built on top of numpy array.

In [2]:
import pandas as pd

## **Core data structures**



*   Series: A one-dimensional array-like object, similar to a list or a NumPy array, but with additional functionality like indexing by labels.
*   DataFrame: A two-dimensional table (like a spreadsheet or SQL table) with rows and columns, where each column can be of a different type (e.g., integer, float, string). Usually columns hold different features such as age, gender. Rows hold different samples.

**Pandas Series object**

Index is automatically assigned

In [None]:
# you can create a series object from a list
st=['harry', 'ron', 'hermione', 100, 0.2, np.nan]
print(st)
ss= pd.Series(st)
ss

['harry', 'ron', 'hermione', 100, 0.2, nan]


Unnamed: 0,0
0,harry
1,ron
2,hermione
3,100
4,0.2
5,


In [None]:
# you can also create a series object from a dictionary
Dict = {'Name': 'Legion', 'category': 'cbm', 'character': 'david haller','identity': 'mutant'}
print(Dict)
ds= pd.Series(Dict)
ds

{'Name': 'Legion', 'category': 'cbm', 'character': 'david haller', 'identity': 'mutant'}


Unnamed: 0,0
Name,Legion
category,cbm
character,david haller
identity,mutant


In [None]:
dir(ds)

['Name',
 'T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__bool__',
 '__class__',
 '__column_consortium_standard__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 '__pos__',
 '__pow__',
 '__radd__',
 

You can see this pandas object offers a great deal of functionalities.

**Pandas DataFrame**

A DataFrame contains multiple columns representing different variables/features/attributes.

In [3]:
# Create a dictionary with some data
data = {
    'Name': ['Bruce', 'Clark', 'Peter', 'Gandalf'],
    'Identity': ['Batman', 'Superman', 'Spiderman', 'Wizard'],
    'Actors': ['Affleck', 'Henry', 'Maguire', 'Ian'],
    'Age': [44, 27, 22, 50000],
    'Score': [10, 9.5, 9, 9.5]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

      Name   Identity   Actors    Age  Score
0    Bruce     Batman  Affleck     44   10.0
1    Clark   Superman    Henry     27    9.5
2    Peter  Spiderman  Maguire     22    9.0
3  Gandalf     Wizard      Ian  50000    9.5


In [None]:
dir(df)

['Actors',
 'Age',
 'Identity',
 'Name',
 'Score',
 'T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__dataframe__',
 '__dataframe_consortium_standard__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pandas_priority__',
 

## **Importing files**

You can import different files such as spreadsheets into your workspace as DataFrame.

In [None]:
data=pd.read_csv('data_file_lab2.csv')
data

Unnamed: 0,dataset,IR,Unsampled,SMOTE,ADASYN,RUS,NC,CNN
0,wisconsin,1.86,96.41,96.39,96.9,97.36,97.89,96.68
1,yeast1,2.46,82.37,88.32,90.38,92.57,89.77,88.54
2,vehicle1,2.9,64.99,74.41,73.35,78.0,76.98,74.65
3,ecoli2,5.46,82.54,86.37,87.01,85.37,87.68,88.37
4,yeast3,8.1,82.26,89.65,90.34,91.2,89.21,89.64
5,ecoli3,8.6,70.45,82.63,77.15,86.26,74.33,76.81
6,page-blocks0,8.79,89.96,92.38,92.33,93.2,92.12,91.58
7,vowel0,9.98,75.36,87.44,81.74,93.02,79.49,94.86
8,glass2,11.59,7.07,26.56,36.56,57.4,13.77,37.3
9,glass4,15.47,71.21,83.64,83.13,86.86,70.7,82.06


Colab also offers some intuitive plots. You can also convert the dataframe into an interactive table.

## **Indexing**

Indexing allows you to select specific rows, columns, or elements of a DataFrame. There are several ways to index data in a pandas DataFrame, such as using labels or numerical positions.

**1. Indexing by Column Name**

You can directly access a column by its name.

In [None]:
df

Unnamed: 0,Name,Identity,Actors,Age,Score
0,Bruce,Batman,Affleck,44,10.0
1,Clark,Superman,Henry,27,9.5
2,Peter,Spiderman,Maguire,22,9.0
3,Gandalf,Wizard,Ian,50000,9.5


In [None]:
df['Actors']

Unnamed: 0,Actors
0,Affleck
1,Henry
2,Maguire
3,Ian


In [None]:
# extract multiple columns
df[['Actors', 'Age', 'Score' ]]

Unnamed: 0,Actors,Age,Score
0,Affleck,44,10.0
1,Henry,27,9.5
2,Maguire,22,9.0
3,Ian,50000,9.5


However, you cannot use this approach to extract rows.

Its just like dictionary. You can extract values from keys, but not the other way around.

In [None]:
df['Bruce']

KeyError: 'Bruce'

**2. Indexing by Row using loc and iloc**

loc: Access rows and columns by label.

iloc: Access rows and columns by integer position.

In [None]:
df.loc[1]

Unnamed: 0,1
Name,Clark
Identity,Superman
Actors,Henry
Age,27
Score,9.5


In [None]:
df.loc[1,3]
# careful when extracting multiple rows or columns

KeyError: 3

In [None]:
df.loc[[1,3]]

Unnamed: 0,Name,Identity,Actors,Age,Score
1,Clark,Superman,Henry,27,9.5
3,Gandalf,Wizard,Ian,50000,9.5


In [None]:
df.loc[1:3,['Name','Actors']]

Unnamed: 0,Name,Actors
1,Clark,Henry
2,Peter,Maguire
3,Gandalf,Ian


In [None]:
df.iloc[3]

Unnamed: 0,3
Name,Gandalf
Identity,Wizard
Actors,Ian
Age,50000
Score,9.5


**Difference between loc and iloc**

In [None]:
# see the difference between loc and iloc

print(df.loc[1:3])
print(" ")
print(df.iloc[1:3])

      Name   Identity   Actors    Age  Score
1    Clark   Superman    Henry     27    9.5
2    Peter  Spiderman  Maguire     22    9.0
3  Gandalf     Wizard      Ian  50000    9.5
 
    Name   Identity   Actors  Age  Score
1  Clark   Superman    Henry   27    9.5
2  Peter  Spiderman  Maguire   22    9.0


In [None]:
df.iloc[1:3,['Name','Actors']]
# try to understand and remember the error messages. It will help you debug your code.

IndexError: .iloc requires numeric indexers, got ['Name' 'Actors']

In [None]:
df.iloc[1:3,[0,2]]

Unnamed: 0,Name,Actors
1,Clark,Henry
2,Peter,Maguire


**3. Boolean Indexing (Masking)**

In [None]:
df[df['Age'] > 25]

Unnamed: 0,Name,Identity,Actors,Age,Score
0,Bruce,Batman,Affleck,44,10.0
1,Clark,Superman,Henry,27,9.5
3,Gandalf,Wizard,Ian,50000,9.5


In [None]:
df[(df['Age'] > 25) & (df['Score'] > 9.5)]

Unnamed: 0,Name,Identity,Actors,Age,Score
0,Bruce,Batman,Affleck,44,10.0


In [None]:
df[df['Age'] > 25]['Actors']

Unnamed: 0,Actors
0,Affleck
1,Henry
3,Ian


## **Adding rows and columns**

**Adding Column**

In [None]:
df

Unnamed: 0,Name,Identity,Actors,Age,Score
0,Bruce,Batman,Affleck,44,10.0
1,Clark,Superman,Henry,27,9.5
2,Peter,Spiderman,Maguire,22,9.0
3,Gandalf,Wizard,Ian,50000,9.5


In [None]:
df['Movie']=['BvS','Man of Steel','Spiderman','Lord of the Rings']
df

Unnamed: 0,Name,Identity,Actors,Age,Score,Movie
0,Bruce,Batman,Affleck,44,10.0,BvS
1,Clark,Superman,Henry,27,9.5,Man of Steel
2,Peter,Spiderman,Maguire,22,9.0,Spiderman
3,Gandalf,Wizard,Ian,50000,9.5,Lord of the Rings


In [None]:
df['Director']= ['Snyder', 'Snyder', 'Peter Jackson']
df

ValueError: Length of values (3) does not match length of index (4)

In [None]:
df['Director']= ['Snyder', 'Snyder', np.nan, 'Peter Jackson']
df

Unnamed: 0,Name,Identity,Actors,Age,Score,Movie,Director
0,Bruce,Batman,Affleck,44,10.0,BvS,Snyder
1,Clark,Superman,Henry,27,9.5,Man of Steel,Snyder
2,Peter,Spiderman,Maguire,22,9.0,Spiderman,
3,Gandalf,Wizard,Ian,50000,9.5,Lord of the Rings,Peter Jackson


**Adding Rows**

In [None]:
new_row=['Bruce','Batman','Keaton', 60, 2, 'Batman Returns', np.nan]
# you have to maintain the sequence

In [None]:
df.append(new_row, ignore_index=True)
df
# the append function has been deprecated in pandas 2.0. You will face this type of things frequently while working with python.

AttributeError: 'DataFrame' object has no attribute 'append'

You have to concat to dataframes for this purpose.

In [None]:
row_to_add= {'Name': 'Bruce',
             'Identity': 'Batman',
             'Actors': 'Keaton',
             'Age': 60,
             'Score':2,
             'Movie': 'Flash',
             'Director': 'Shit shit shit'}
row= pd.DataFrame([row_to_add])
row

Unnamed: 0,Name,Identity,Actors,Age,Score,Movie,Director
0,Bruce,Batman,Keaton,60,2,Flash,Shit shit shit


In [None]:
df_new= pd.concat([df,row], ignore_index=True)
df_new

Unnamed: 0,Name,Identity,Actors,Age,Score,Movie,Director
0,Bruce,Batman,Affleck,44,10.0,BvS,Snyder
1,Clark,Superman,Henry,27,9.5,Man of Steel,Snyder
2,Peter,Spiderman,Maguire,22,9.0,Spiderman,
3,Gandalf,Wizard,Ian,50000,9.5,Lord of the Rings,Peter Jackson
4,Bruce,Batman,Keaton,60,2.0,Flash,Shit shit shit


You can also add multiple rows in the same way. Usually, we dont need this.

##**Common Functionalities**

In [4]:
df

Unnamed: 0,Name,Identity,Actors,Age,Score
0,Bruce,Batman,Affleck,44,10.0
1,Clark,Superman,Henry,27,9.5
2,Peter,Spiderman,Maguire,22,9.0
3,Gandalf,Wizard,Ian,50000,9.5


In [6]:
# data types
df.dtypes

Unnamed: 0,0
Name,object
Identity,object
Actors,object
Age,int64
Score,float64


In [8]:
# viewing
# in case of large datasets, you may only want to view the first or last few rows
df.head(2)
# df.tail()

Unnamed: 0,Name,Identity,Actors,Age,Score
0,Bruce,Batman,Affleck,44,10.0
1,Clark,Superman,Henry,27,9.5


In [9]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [10]:
df.columns

Index(['Name', 'Identity', 'Actors', 'Age', 'Score'], dtype='object')

In [11]:
df.values

array([['Bruce', 'Batman', 'Affleck', 44, 10.0],
       ['Clark', 'Superman', 'Henry', 27, 9.5],
       ['Peter', 'Spiderman', 'Maguire', 22, 9.0],
       ['Gandalf', 'Wizard', 'Ian', 50000, 9.5]], dtype=object)

In [13]:
df.describe()

Unnamed: 0,Age,Score
count,4.0,4.0
mean,12523.25,9.5
std,24984.501774,0.408248
min,22.0,9.0
25%,25.75,9.375
50%,35.5,9.5
75%,12533.0,9.625
max,50000.0,10.0


In [14]:
df.sort_values(by='Score')

Unnamed: 0,Name,Identity,Actors,Age,Score
2,Peter,Spiderman,Maguire,22,9.0
1,Clark,Superman,Henry,27,9.5
3,Gandalf,Wizard,Ian,50000,9.5
0,Bruce,Batman,Affleck,44,10.0


## Saving as excel file

In [16]:
df.to_csv('my_dataset.csv')