# Task 12: Introduction to Pandas (Series, DataFrame basics)

Time to explore the high level **Data Manipulation** tool offered by the Pandas library. Start by learning about Series and DataFrames and their operations.

In Pandas, we store the tabular data in an object called **DataFrame**.

### Importing Libraries:

In [1]:
import pandas as pd
import numpy as np

### Create Pandas Series:

##### From Dictionary:

In [2]:
europe = {'Spain':'Madrid', 'France':'Paris', 'Germany':'Bonn', 'Norway':'Oslo'}

series_dict = pd.Series(europe)
print("Pandas Series from Dictionary:")
print(series_dict)

Pandas Series from Dictionary:
Spain      Madrid
France      Paris
Germany      Bonn
Norway       Oslo
dtype: object


#### Data Manipulation on Dictionary:

In [13]:
# add italy to europe
europe['Italy'] = 'Rome'
print("Europe Dict. after adding Italy:\n", europe)

# add poland to europe
europe['Poland'] = 'Warsaw'
print("\nEurope Dict. after adding Poland:\n", europe)

# Update capital of germany
europe['Germany'] = 'Berlin'
print("\nEurope Dict. after updating Capital of Germany:\n", europe)

Europe Dict. after adding Italy:
 {'Spain': 'Madrid', 'France': 'Paris', 'Germany': 'Berlin', 'Norway': 'Oslo', 'Italy': 'Rome', 'Poland': 'Warsaw'}

Europe Dict. after adding Poland:
 {'Spain': 'Madrid', 'France': 'Paris', 'Germany': 'Berlin', 'Norway': 'Oslo', 'Italy': 'Rome', 'Poland': 'Warsaw'}

Europe Dict. after updating Capital of Germany:
 {'Spain': 'Madrid', 'France': 'Paris', 'Germany': 'Berlin', 'Norway': 'Oslo', 'Italy': 'Rome', 'Poland': 'Warsaw'}


##### From List (with custom indexing):

In [3]:
areas = [11.25, 18.0, 20.0, 10.75, 9.50]

series_list = pd.Series(areas)

# assign a custom index to the Series
cust_ind = ['a', 'b', 'c', 'd', 'e']
series_list.index = cust_ind

print("Pandas Series from List:")
print(series_list)

Pandas Series from List:
a    11.25
b    18.00
c    20.00
d    10.75
e     9.50
dtype: float64


##### From NumPy Array (with custom indexing):

In [4]:
house_length = np.array([18.0, 20.0, 10.75, 9.50])

series_nparr = pd.Series(house_length)

# assign a custom index to the Series
cus_ind = ['A', 'B', 'C', 'D']
series_nparr.index = cus_ind

print("Pandas Series from NumPy Array:")
print(series_nparr)

Pandas Series from NumPy Array:
A    18.00
B    20.00
C    10.75
D     9.50
dtype: float64


### Perform Basic Arithmetic Operation:

In [7]:
# arithmetic operation on Pandas list series
series_add = series_list + 10
print("List Series after adding 10:")
print(series_add)
print()


# divide operation on Pandas list add series
series_div = series_add / 10
print("Add. List Series after dividing each element by 10:")
print(series_div)
print()


# multipy operation on Pandas Numpy array series
series_multi = series_nparr * 2
print("NumPy array Series after multiplying each element by 2:")
print(series_multi)
print()

# subtract operation on Numpy array series
series_sub = series_nparr - 5
print("NumPy array Series after subtracting 5:")
print(series_sub)
print()

List Series after adding 10:
a    21.25
b    28.00
c    30.00
d    20.75
e    19.50
dtype: float64

Add. List Series after dividing each element by 10:
a    2.125
b    2.800
c    3.000
d    2.075
e    1.950
dtype: float64

NumPy array Series after multiplying each element by 2:
A    36.0
B    40.0
C    21.5
D    19.0
dtype: float64

NumPy array Series after subtracting 5:
A    13.00
B    15.00
C     5.75
D     4.50
dtype: float64



### Accessing Elements in Pandas Series:

##### Access Dictionary:

In [14]:
# Print out the keys in europe
print(europe.keys())

# Print value that belongs to key 'Norway'
print(europe["Norway"])

dict_keys(['Spain', 'France', 'Germany', 'Norway', 'Italy', 'Poland'])
Oslo


##### Access Elements in List:

In [15]:
# Access element by index label
print("Element accessed by index label 'c':", series_list['c'])

Element accessed by index label 'c': 20.0


In [16]:
# Access element by position 3 (zero-based index)
print("Element accessed by position 3:", series_list.iloc[3])

Element accessed by position 3: 10.75


We access a single element by its position (3) using the **'.iloc'** method (zero-based index).

##### Access Elements in NumPy Array:

In [5]:
# Access element by index label
print("Element accessed by index label 'B':", series_nparr['B'])

Element accessed by index label 'B': 20.0


In [6]:
# Access element by position 1 (zero-based index)
print("Element accessed by position 1:", series_list.iloc[1])

Element accessed by position 1: 18.0


### Create a DataFrame from dictionary of lists:

In [10]:
#dr, a list with booleans that tells whether people drive left or right in the corresponding country.
#cpc, the number of motor vehicles per 1000 people

names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# create dictionary my_dict with three key:value pairs: my_dict
my_dict = {
    'country':names,
    'drives_right':dr,
    'cars_per_cap':cpc 
}
# build a DataFrame cars from my_dict: cars
cars1 = pd.DataFrame(my_dict)

print(cars1)

         country  drives_right  cars_per_cap
0  United States          True           809
1      Australia         False           731
2          Japan         False           588
3          India         False            18
4         Russia          True           200
5        Morocco          True            70
6          Egypt          True            45


##### Specify Row Labels:

In [11]:
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']

# specify row labels of cars
cars1.index = row_labels

print(cars1)

           country  drives_right  cars_per_cap
US   United States          True           809
AUS      Australia         False           731
JPN          Japan         False           588
IN           India         False            18
RU          Russia          True           200
MOR        Morocco          True            70
EG           Egypt          True            45


### Create a DataFrame from a numpy array, specifying column and index names:

In [9]:
# create NumPy array
data = np.array([[75, 71, 86], [65, 50, 67], [70, 87, 90]])

# specify custom columns and index names
column_names = ['Physics', 'Math', 'Chemistry']
index_names = ['Student1', 'Student2', 'Student3']

# create the DataFrame
df = pd.DataFrame(data, columns=column_names, index=index_names)

print(df)

          Physics  Math  Chemistry
Student1       75    71         86
Student2       65    50         67
Student3       70    87         90


### Load a DataFrame from a CSV file:

In [13]:
cars = pd.read_csv('cars.csv', index_col = 0)

print(cars)

    Unnamed: 0  cars_per_cap        country  drives_right
NaN         US           809  United States          True
NaN        AUS           731      Australia         False
NaN        JAP           588          Japan         False
NaN         IN            18          India         False
NaN         RU           200         Russia          True
NaN        MOR            70        Morocco          True
NaN         EG            45          Egypt          True


##### Square Brackets:

In [14]:
# print country column as Pandas DataFrame
print(cars[['country']])

           country
NaN  United States
NaN      Australia
NaN          Japan
NaN          India
NaN         Russia
NaN        Morocco
NaN          Egypt


In [15]:
# print out DataFrame with country and drives_right columns
print(cars[['country', 'drives_right']])

           country  drives_right
NaN  United States          True
NaN      Australia         False
NaN          Japan         False
NaN          India         False
NaN         Russia          True
NaN        Morocco          True
NaN          Egypt          True


##### Filtering DataFrame:

In [17]:
# extract drives_right column as a Pandas Series: dr
dr = cars["drives_right"]

# use dr to subset cars: sel
final_dr = cars[dr]
print(final_dr)

    Unnamed: 0  cars_per_cap        country  drives_right
NaN         US           809  United States          True
NaN         RU           200         Russia          True
NaN        MOR            70        Morocco          True
NaN         EG            45          Egypt          True


**OR**

In [18]:
# convert code to a one-liner
f_dr = cars[cars["drives_right"]]

print(f_dr)

    Unnamed: 0  cars_per_cap        country  drives_right
NaN         US           809  United States          True
NaN         RU           200         Russia          True
NaN        MOR            70        Morocco          True
NaN         EG            45          Egypt          True


##### Filtering based on condition:

In [19]:
# create car_maniac: observations that have a cars_per_cap over 500
cpc = cars["cars_per_cap"]
many_cars = cpc > 500
car_maniac = cars[many_cars]

print(car_maniac)

    Unnamed: 0  cars_per_cap        country  drives_right
NaN         US           809  United States          True
NaN        AUS           731      Australia         False
NaN        JAP           588          Japan         False


##### Adding column in DataFrame:

In [20]:
# use .apply(str.upper)
cars["COUNTRY"] = cars["country"].apply(str.upper)
print(cars)

    Unnamed: 0  cars_per_cap        country  drives_right        COUNTRY
NaN         US           809  United States          True  UNITED STATES
NaN        AUS           731      Australia         False      AUSTRALIA
NaN        JAP           588          Japan         False          JAPAN
NaN         IN            18          India         False          INDIA
NaN         RU           200         Russia          True         RUSSIA
NaN        MOR            70        Morocco          True        MOROCCO
NaN         EG            45          Egypt          True          EGYPT


##### Removing column from the DataFrame:

In [22]:
# remove a column
cars_new = cars.drop('country', axis=1)
print(cars_new)

    Unnamed: 0  cars_per_cap  drives_right        COUNTRY
NaN         US           809          True  UNITED STATES
NaN        AUS           731         False      AUSTRALIA
NaN        JAP           588         False          JAPAN
NaN         IN            18         False          INDIA
NaN         RU           200          True         RUSSIA
NaN        MOR            70          True        MOROCCO
NaN         EG            45          True          EGYPT


df.drop('ColumnToRemove', axis=1, inplace=True)

- **axis=1** specifies that we are dropping columns. For rows, you would use axis=0.
- **inplace=True** can be used within drop if you want to modify the original DataFrame without creating a new one.

##### loc and iloc:

In [25]:
# print out observation for Japan
print(cars_new.iloc[2])

# print out observations for Australia and Egypt
print(cars_new.iloc[[1,6]])

Unnamed: 0        JAP
cars_per_cap      588
drives_right    False
COUNTRY         JAPAN
Name: nan, dtype: object
    Unnamed: 0  cars_per_cap  drives_right    COUNTRY
NaN        AUS           731         False  AUSTRALIA
NaN         EG            45          True      EGYPT


- Use **'loc'** for label-based selection, which allows you to access data by row and column labels.
- Use **'iloc'** for position-based selection, which allows you to access data by row and column integer positions.