# The Pandas Library: Basic Functionality

Pandas is a powerful Python library for data analysis and manipulation. It provides data structures and functions needed to work with structured data seamlessly.

This notebook will cover the basic functionalities and essential functions provided by pandas, making it easier to analyze data in Python.

In [1]:
# Importing the pandas library
import os
import pandas as pd

# Show the version of pandas
print(f'The installes version of pandas is: {pd.__version__}')

# Show the current working directory
print(f'\nThe current working directory is: {os.getcwd()}')


The installes version of pandas is: 2.2.3

The current working directory is: /workspaces/applied_research_methods/Week_02



## Creating DataFrames and Series

Pandas has two primary data structures:

1. **Series** - 1-dimensional labeled array
2. **DataFrame** - 2-dimensional labeled data structure

Let's see how to create both.


In [2]:
# Creating a Series
data = [1, 2, 3, 4, 5]
series = pd.Series(data, name="Numbers")
series


0    1
1    2
2    3
3    4
4    5
Name: Numbers, dtype: int64

In [3]:
# Creating a Python dictionary with data
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [24, 27, 22, 32],
    "City": ["New York", "San Francisco", "Los Angeles", "Chicago"]
}

# Converting the dictionary to a DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,San Francisco
2,Charlie,22,Los Angeles
3,David,32,Chicago



## Inspecting Data

Once you have a DataFrame, you might want to inspect it. Some key methods include:
- `head()`: View the first few rows of the DataFrame.
- `tail()`: View the last few rows of the DataFrame.
- `info()`: Get a summary of the DataFrame.
- `describe()`: Get statistical summary of numerical columns.


In [4]:
# Viewing first few rows
df.head()


Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,San Francisco
2,Charlie,22,Los Angeles
3,David,32,Chicago


In [5]:
# Viewing last few rows
df.tail()


Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,San Francisco
2,Charlie,22,Los Angeles
3,David,32,Chicago


In [6]:
# Summary information of the DataFrame
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes


In [7]:
# Statistical summary of numerical columns
df.describe()


Unnamed: 0,Age
count,4.0
mean,26.25
std,4.349329
min,22.0
25%,23.5
50%,25.5
75%,28.25
max,32.0



## Data Selection

You can select data in a DataFrame using column names, or by using slicing.

- Selecting a single column returns a Series.
- Selecting multiple columns returns a DataFrame.


In [8]:
# Selecting a single column
df['Name']


0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object

In [9]:
# Selecting multiple columns
df[['Name', 'Age']]


Unnamed: 0,Name,Age
0,Alice,24
1,Bob,27
2,Charlie,22
3,David,32



### Selecting Rows

Pandas provides two main ways to select rows:
- **loc**: Select by label.
- **iloc**: Select by index position.


In [10]:
# Selecting rows by label (loc)
df.loc[df['Name'] == 'Alice']


Unnamed: 0,Name,Age,City
0,Alice,24,New York


In [11]:
# Selecting rows by index position (iloc)
df.iloc[2]  # Third row


Name        Charlie
Age              22
City    Los Angeles
Name: 2, dtype: object


## Filtering Data

Filtering data in a DataFrame is done using boolean conditions.


In [12]:
# Filter rows where Age > 25
df[df['Age'] > 25]


Unnamed: 0,Name,Age,City
1,Bob,27,San Francisco
3,David,32,Chicago



## Adding/Modifying Columns

New columns can be added to a DataFrame, or existing columns can be modified.


In [13]:
# Adding a new column 'Salary'
df['Salary'] = [70000, 80000, 50000, 120000]
df


Unnamed: 0,Name,Age,City,Salary
0,Alice,24,New York,70000
1,Bob,27,San Francisco,80000
2,Charlie,22,Los Angeles,50000
3,David,32,Chicago,120000


In [14]:
# Modifying an existing column
df['Age'] = df['Age'] + 1  # Incrementing each age by 1
df


Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,70000
1,Bob,28,San Francisco,80000
2,Charlie,23,Los Angeles,50000
3,David,33,Chicago,120000



## Dropping Columns or Rows

You can remove columns or rows using the `drop` function.


In [15]:
# Dropping the 'Salary' column
df.drop('Salary', axis=1, inplace=True)
df


Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,28,San Francisco
2,Charlie,23,Los Angeles
3,David,33,Chicago


In [16]:
# Dropping a row by index
df.drop(0, axis=0, inplace=True)  # Dropping the first row
df


Unnamed: 0,Name,Age,City
1,Bob,28,San Francisco
2,Charlie,23,Los Angeles
3,David,33,Chicago



## Data Aggregation

Pandas provides many functions to perform data aggregation, like `sum`, `mean`, `min`, `max`, etc.


In [17]:
# Calculating the mean age
df['Age'].mean()


np.float64(28.0)


## Grouping Data

Using `groupby`, you can group data by a specific column and perform aggregate functions.


In [18]:
# Grouping by City and calculating mean age for each city
df.groupby('City')['Age'].mean()


City
Chicago          33.0
Los Angeles      23.0
San Francisco    28.0
Name: Age, dtype: float64


## Handling Missing Data

Pandas provides methods to handle missing data, like `fillna` and `dropna`.


In [19]:
# Adding missing values to demonstrate handling
df.loc[2, 'Age'] = None  # Setting age in the third row to NaN
df


Unnamed: 0,Name,Age,City
1,Bob,28.0,San Francisco
2,Charlie,,Los Angeles
3,David,33.0,Chicago


In [20]:
# Handling missing values by filling NaN with 0
df['Age'] = df['Age'].fillna(0)
df

Unnamed: 0,Name,Age,City
1,Bob,28.0,San Francisco
2,Charlie,0.0,Los Angeles
3,David,33.0,Chicago


In [21]:
# Dropping rows with missing values
df.dropna(inplace=True)
df


Unnamed: 0,Name,Age,City
1,Bob,28.0,San Francisco
2,Charlie,0.0,Los Angeles
3,David,33.0,Chicago



## Saving and Loading Data

Pandas can easily save DataFrames to files and read from files. Common formats are CSV, Excel, and JSON.


In [22]:
# Saving to a CSV file
df.to_csv("sample_data.csv", index=False)


In [23]:

# Loading from a CSV file
loaded_df = pd.read_csv("sample_data.csv")
loaded_df


Unnamed: 0,Name,Age,City
0,Bob,28.0,San Francisco
1,Charlie,0.0,Los Angeles
2,David,33.0,Chicago



## Conclusion

This notebook covered the basics of pandas, including data creation, inspection, selection, filtering, modification, aggregation, handling missing data, and saving/loading data.

### Jupyter notebook --footer info-- (please always provide this at the end of each submitted notebook)

In [24]:
import os
import platform
import socket
from platform import python_version
from datetime import datetime

print('-----------------------------------')
print(os.name.upper())
print(platform.system(), '|', platform.release())
print('Datetime:', datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print('Python Version:', python_version())
print('-----------------------------------')

-----------------------------------
POSIX
Linux | 6.5.0-1025-azure
Datetime: 2024-11-08 09:58:41
Python Version: 3.11.10
-----------------------------------
