#  1.3.1 Pandas Basics

Welcome to our introduction to `Pandas`. 

`Pandas` is an essential library in Python for data manipulation and analysis. 

It builds on `NumPy` to provide high-level data structures and tools. 

If you’re involved in data science, `Pandas` is crucial for cleaning, preparing, and exploring your data.

---
## 1.3.1.1 Pandas Series
---

Let’s start with the *Pandas Series*. 

A *Series* is a one-dimensional array-like object that can hold various data types, such as integers, strings, or floats. 

It’s quite similar to a *list* or a *single column* in a table. 

We’ll create a simple *Series* from a list of integers.


In [1]:
## Code snippet 1.3.1.1(a)

import pandas as pd

# Creating a Series
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

0    10
1    20
2    30
3    40
4    50
dtype: int64


In [5]:
## Code snippet 1.3.1.1(b)
## Determine the frequency for each color in the list
col2007 = ['red',
 'green',
 'blue',
 'blue',
 'yellow',
 'black',
 'green',
 'red',
 'red',
 'green',
 'blue',
 'yellow',
 'green',
 'blue',
 'yellow',
 'green',
 'blue',
 'blue',
 'red',
 'blue',
 'yellow',
 'blue',
 'blue',
 'yellow',
 'red',
 'yellow',
 'blue',
 'blue',
 'blue',
 'yellow',
 'blue',
 'green',
 'yellow',
 'green',
 'green',
 'blue',
 'yellow',
 'yellow',
 'blue',
 'yellow',
 'blue',
 'blue',
 'blue',
 'green',
 'green',
 'blue',
 'blue',
 'green',
 'blue',
 'green',
 'yellow',
 'blue',
 'blue',
 'yellow',
 'yellow',
 'red',
 'green',
 'green',
 'red',
 'red',
 'red',
 'red',
 'green',
 'red',
 'green',
 'yellow',
 'red',
 'red',
 'blue',
 'red',
 'red',
 'red',
 'red',
 'blue',
 'blue',
 'blue',
 'blue',
 'blue',
 'red',
 'blue',
 'blue',
 'blue',
 'yellow',
 'red',
 'green',
 'blue',
 'blue',
 'red',
 'blue',
 'red',
 'green',
 'black',
 'yellow',
 'blue',
 'blue',
 'green',
 'red',
 'red',
 'yellow',
 'yellow',
 'yellow',
 'red',
 'green',
 'green',
 'yellow',
 'blue',
 'green',
 'blue',
 'blue',
 'red',
 'blue',
 'green',
 'blue',
 'red',
 'green',
 'green',
 'blue',
 'blue',
 'green',
 'red',
 'blue',
 'blue',
 'green',
 'green',
 'red',
 'red',
 'blue',
 'red',
 'blue',
 'yellow',
 'blue',
 'green',
 'blue',
 'green',
 'yellow',
 'yellow',
 'yellow',
 'red',
 'red',
 'red',
 'blue',
 'blue']

col2007_series = pd.Series(col2007)
counts = col2007_series.value_counts()
print(counts)

blue      52
red       33
green     30
yellow    25
black      2
Name: count, dtype: int64


---
## 1.3.1.2  Pandas DataFrame
---

Next, we have the *Pandas DataFrame*. 

A *DataFrame* is a two-dimensional table-like structure that can store data of different types. 

It’s mutable in size, meaning you can add or remove data easily. 

Think of it as a table in a database or an Excel spreadsheet. Here’s how to create a DataFrame from a dictionary.


In [8]:
## Code snippet 1.3.1.2

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
}

#  Creatting a DataFrame from a dictionary 
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston
4      Eve   29      Phoenix


---
1.3.1.3 DataFrame from Dictionary
---

Creating DataFrames from dictionaries is very straightforward in Pandas. 

Each *key* in the dictionary becomes a *column* in the DataFrame, and each *value* becomes the *data* for that column. 

Let’s look at an example with a simple dictionary.


In [9]:
## Code snippet 1.3.1.3

# Creating a DataFrame from a dictionary
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}
df = pd.DataFrame(data)
print(df)

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9


---
1.3.1.4  DataFrame from CSV
---

One of the powerful features of P*andas* is its ability to read data from external files like *CSV*s. 

The `read_csv()` function makes it easy to load data from a *CSV* file into a *DataFrame*. 

Let’s see how this works.

In [46]:
## Code snippet 1.3.1.4

import pandas as pd

# Reading a CSV file into a DataFrame
data2007 = pd.read_csv("datasets/gapminder.csv")

# Finding information about the DataFrame
data2007.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     142 non-null    object 
 1   year        142 non-null    int64  
 2   population  142 non-null    int64  
 3   cont        142 non-null    object 
 4   life_exp    142 non-null    float64
 5   gdp_cap     142 non-null    float64
dtypes: float64(2), int64(2), object(2)
memory usage: 6.8+ KB


In [47]:
print(data2007)

                country  year  population      cont  life_exp       gdp_cap
0           Afghanistan  2007    31889923      Asia    43.828    974.580338
1               Albania  2007     3600523    Europe    76.423   5937.029526
2               Algeria  2007    33333216    Africa    72.301   6223.367465
3                Angola  2007    12420476    Africa    42.731   4797.231267
4             Argentina  2007    40301927  Americas    75.320  12779.379640
..                  ...   ...         ...       ...       ...           ...
137             Vietnam  2007    85262356      Asia    74.249   2441.576404
138  West Bank and Gaza  2007     4018332      Asia    73.422   3025.349798
139         Yemen, Rep.  2007    22211743      Asia    62.698   2280.769906
140              Zambia  2007    11746035    Africa    42.384   1271.211593
141            Zimbabwe  2007    12311143    Africa    43.487    469.709298

[142 rows x 6 columns]


---
## 1.3.1.5  Attributes of Pandas Series
---

*Pandas Series* come with several useful attributes, such as *index*, *values*, and *dtype*. 

These attributes help you understand the structure and type of data you’re working with. 

Let’s create a Series and explore these attributes.

In [27]:
## Code snippet 1.3.1.5

series = pd.Series([10, 20, 30, 40, 50])
print("Index:", series.index)
print("Values:", series.values)
print("Data type:", series.dtype)
print("Shape:", series.shape)
print("Size:", series.size)

Index: RangeIndex(start=0, stop=5, step=1)
Values: [10 20 30 40 50]
Data type: int64
Shape: (5,)
Size: 5


---
## 1.3.1.6  Attributes of Pandas DataFrame
---

Similarly, *DataFrames* have several useful attributes, such as *columns*, *index*, *dtypes*, *shape*, and *size*. 

These attributes provide insights into the *structure* and *size* of the *DataFrame*. 

Let’s create a *DataFrame* and look at its attributes.”



In [25]:
## Code snippet 1.3.1.6

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 27, 22],
    'City': ['New York', 'Los Angeles', 'Chicago']
})
print("Columns:", df.columns)
print("Index:", df.index)
print("\nData types:", df.dtypes)
print("\nShape:", df.shape)
print("Size:", df.size)

Columns: Index(['Name', 'Age', 'City'], dtype='object')
Index: RangeIndex(start=0, stop=3, step=1)

Data types: Name    object
Age      int64
City    object
dtype: object

Shape: (3, 3)
Size: 9


---
## 1.3.1.7  Basic Operations on DataFrames
---

*DataFrames* allow you to perform a variety of operations, such as 

*  selecting columns, 
*  filtering rows, 
*  adding or deleting columns, and 
*  generating summary statistics. 

These operations are essential for data manipulation and analysis. 

Let’s go through some basic operations on a DataFrame.

In [41]:
## Code snippet 1.3.1.7(a)

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
})

# Selecting a column
ages = df['Age']
print("Ages: \n", ages)

# Filtering rows
young_people = df[df['Age'] < 30]
print("\nYoung people:\n", young_people)


Ages: 
 0    24
1    27
2    22
3    32
4    29
Name: Age, dtype: int64

Young people:
       Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
4      Eve   29      Phoenix


In [44]:
## Code snippet 1.3.1.7(b)

# Adding a new column
df['Score'] = [85, 90, 95, 88, 92]
print("DataFrame with Score column added:\n", df)

# Deleting a column
df = df.drop(columns=['Score'])
print("\nDataFrame with Score column deleted:\n", df)


DataFrame with Score column added:
       Name  Age         City  Score
0    Alice   24     New York     85
1      Bob   27  Los Angeles     90
2  Charlie   22      Chicago     95
3    David   32      Houston     88
4      Eve   29      Phoenix     92

DataFrame with Score column deleted:
       Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston
4      Eve   29      Phoenix


In [45]:
## Code snippet 1.3.1.7(c)

# Summary statistics
print("\nSummary statistics:\n", df.describe())


Summary statistics:
              Age
count   5.000000
mean   26.800000
std     3.962323
min    22.000000
25%    24.000000
50%    27.000000
75%    29.000000
max    32.000000


In [52]:
## Code snippet 1.3.1.7(d)

data2007.info()


data2007.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     142 non-null    object 
 1   year        142 non-null    int64  
 2   population  142 non-null    int64  
 3   cont        142 non-null    object 
 4   life_exp    142 non-null    float64
 5   gdp_cap     142 non-null    float64
dtypes: float64(2), int64(2), object(2)
memory usage: 6.8+ KB


Unnamed: 0,year,population,life_exp,gdp_cap
count,142.0,142.0,142.0,142.0
mean,2007.0,44021220.0,67.007423,11680.07182
std,0.0,147621400.0,12.073021,12859.937337
min,2007.0,199579.0,39.613,277.551859
25%,2007.0,4508034.0,57.16025,1624.842248
50%,2007.0,10517530.0,71.9355,6124.371108
75%,2007.0,31210040.0,76.41325,18008.83564
max,2007.0,1318683000.0,82.603,49357.19017


---
Summary
---

*	Pandas is essential for data manipulation in Python.
*	Series and DataFrames are the core data structures.
*	DataFrames can be created from dictionaries, CSV files, and other data sources.
*	Series and DataFrames have useful attributes and methods for data analysis.
*	Basic operations allow for effective data manipulation and analysis.