<a href="https://colab.research.google.com/github/leonidke/Excel-Tutorial/blob/main/Pandas_Challenges.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Sample Notebook] AfterWork: Pandas Challenge - Day 24

## Pre-requisite

In [28]:
# Import pandas for data manipulation
import pandas as pd
import numpy as np

## 1. Working with datetime64[ns] Data Type

We work with the datetime64[ns] data type in Pandas when dealing with date and time information. This data type allows us to efficiently store and manipulate date and time values in our DataFrame. For example, in a healthcare setting, we can use the datetime64[ns] data type to analyze patient admission and discharge times, track medication administration schedules, and monitor appointment booking trends.

To work with datetime64[ns] data type, we can convert existing columns to this data type using the pd.to_datetime() function, extract specific components like year, month, day, hour, minute, and second using dt accessor, and perform date arithmetic operations to calculate time differences or create new date columns.



In [2]:
# Load the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/e/patients_t5o08.csv')
df.head(5)

Unnamed: 0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Admission Date,Discharge Date,Follow-up Date
0,P001,John,Doe,45,Male,Hypertension,Medication,2021-05-10,2021-05-15,2021-06-01
1,P002,Jane,Smith,32,Female,Diabetes,Diet & Exercise,2021-04-20,2021-04-25,2021-05-10
2,P003,Michael,Johnson,50,Male,Arthritis,Physical Therapy,2021-06-05,2021-06-15,2021-07-01
3,P004,Sarah,Williams,28,Female,Anxiety,Counseling,2021-03-15,2021-03-20,2021-04-05
4,P005,David,Brown,60,Male,Heart Disease,Medication & Monitoring,2021-07-10,2021-07-20,2021-08-05


In [3]:
# Checking the columns data types
df.dtypes

Patient ID        object
First Name        object
Last Name         object
Age                int64
Gender            object
Diagnosis         object
Treatment         object
Admission Date    object
Discharge Date    object
Follow-up Date    object
dtype: object

In [5]:
# Convert 'Admission Date', 'Discharge Date', and 'Follow-up Date' columns to datetime64[ns]
df['Admission Date'] = pd.to_datetime(df['Admission Date'], format='mixed')
df['Discharge Date'] = pd.to_datetime(df['Discharge Date'], format='mixed')
df['Follow-up Date'] = pd.to_datetime(df['Follow-up Date'], format='mixed')

# Checking the columns data types
df.head(5)

Unnamed: 0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Admission Date,Discharge Date,Follow-up Date
0,P001,John,Doe,45,Male,Hypertension,Medication,2021-05-10,2021-05-15,2021-06-01
1,P002,Jane,Smith,32,Female,Diabetes,Diet & Exercise,2021-04-20,2021-04-25,2021-05-10
2,P003,Michael,Johnson,50,Male,Arthritis,Physical Therapy,2021-06-05,2021-06-15,2021-07-01
3,P004,Sarah,Williams,28,Female,Anxiety,Counseling,2021-03-15,2021-03-20,2021-04-05
4,P005,David,Brown,60,Male,Heart Disease,Medication & Monitoring,2021-07-10,2021-07-20,2021-08-05


### <font color="green">Challenge</font>

Given the dataset of clinics with information on their opening and closing times, your task is to create a new column named 'Operating Hours' that calculates the total operating hours of each clinic per day. The 'Operating Hours' column should display the time duration in hours and minutes. Use the datetime64[ns] data type and relevant functions to perform this task. You can access the dataset from the URL: https://afterwork.ai/ds/ch/clinics_k2a6w.csv


In [6]:
# Load the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/ch/clinics_k2a6w.csv')
df.head()

Unnamed: 0,Clinic ID,Clinic Name,Specialty,Location,Doctor Name,Doctor Email,Contact Number,Opening Time,Closing Time,Date Established
0,101,City Health Clinic,General Medicine,New York,Dr. Smith,drsmith@example.com,123-456-7890,08:00:00,17:00:00,2005-06-15
1,102,Sunrise Medical Center,Pediatrics,Los Angeles,Dr. Johnson,drjohnson@example.com,234-567-8901,09:00:00,18:00:00,2010-03-20
2,103,Green Valley Family Clinic,Family Medicine,Chicago,Dr. Brown,drbrown@example.com,345-678-9012,07:30:00,16:30:00,2008-11-10
3,104,Ocean View Dental Clinic,Dentistry,Miami,Dr. Lee,drlee@example.com,456-789-0123,08:30:00,17:30:00,2015-09-25
4,105,Elite Orthopedic Center,Orthopedics,San Francisco,Dr. Martinez,drmartinez@example.com,567-890-1234,10:00:00,19:00:00,2012-07-05


In [7]:
# Checking the columns data types
df.dtypes


Clinic ID            int64
Clinic Name         object
Specialty           object
Location            object
Doctor Name         object
Doctor Email        object
Contact Number      object
Opening Time        object
Closing Time        object
Date Established    object
dtype: object

In [18]:
# Convert 'Opening Time' and 'Closing Time' columns to datetime64[ns]
df['Opening Time'] = pd.to_datetime(df['Opening Time'], format='%H:%M:%S')
df['Closing Time'] = pd.to_datetime(df['Closing Time'], format='%H:%M:%S')
df['Date Established'] = pd.to_datetime(df['Date Established'], format='mixed')
# Checking the columns data types
df.dtypes


Clinic ID                     int64
Clinic Name                  object
Specialty                    object
Location                     object
Doctor Name                  object
Doctor Email                 object
Contact Number               object
Opening Time         datetime64[ns]
Closing Time         datetime64[ns]
Date Established     datetime64[ns]
Operating Hours     timedelta64[ns]
dtype: object

In [19]:
# Calculate the operating hours for each clinic per day
df['Operating Hours'] = df['Closing Time'] - df['Opening Time']

# Preview the DataFrame with the new 'Operating Hours' column
df.head(5)


Unnamed: 0,Clinic ID,Clinic Name,Specialty,Location,Doctor Name,Doctor Email,Contact Number,Opening Time,Closing Time,Date Established,Operating Hours
0,101,City Health Clinic,General Medicine,New York,Dr. Smith,drsmith@example.com,123-456-7890,1900-01-01 08:00:00,1900-01-01 17:00:00,2005-06-15,0 days 09:00:00
1,102,Sunrise Medical Center,Pediatrics,Los Angeles,Dr. Johnson,drjohnson@example.com,234-567-8901,1900-01-01 09:00:00,1900-01-01 18:00:00,2010-03-20,0 days 09:00:00
2,103,Green Valley Family Clinic,Family Medicine,Chicago,Dr. Brown,drbrown@example.com,345-678-9012,1900-01-01 07:30:00,1900-01-01 16:30:00,2008-11-10,0 days 09:00:00
3,104,Ocean View Dental Clinic,Dentistry,Miami,Dr. Lee,drlee@example.com,456-789-0123,1900-01-01 08:30:00,1900-01-01 17:30:00,2015-09-25,0 days 09:00:00
4,105,Elite Orthopedic Center,Orthopedics,San Francisco,Dr. Martinez,drmartinez@example.com,567-890-1234,1900-01-01 10:00:00,1900-01-01 19:00:00,2012-07-05,0 days 09:00:00


## 2. Using Time as the Index in Pandas DataFrames

We use time as the index in Pandas DataFrames to organize and access time-series data efficiently. By setting the time column as the index, we can easily perform time-based operations, such as resampling, slicing, and plotting. For example, in healthcare, we can analyze patient vital signs over time by setting the timestamp as the index.

To apply this concept, we first ensure the time column is in datetime format using pd.to_datetime(). Then, we set this column as the index using set_index('time_column'). This allows us to directly access and manipulate time-series data based on specific time intervals.



In [20]:
# Load the dataset from the URL
df = pd.read_csv('https://afterwork.ai/ds/e/patients_idfug.csv')
df.head(2)

Unnamed: 0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Admission Date,Discharge Date,Doctor
0,1001,John,Doe,45,Male,Heart Disease,Medication,2020-01-05,2020-01-15,Dr. Smith
1,1002,Jane,Smith,32,Female,Diabetes,Diet Control,2019-11-10,2019-11-20,Dr. Johnson


In [21]:
# Convert 'Admission Date' column to datetime format
df['Admission Date'] = pd.to_datetime(df['Admission Date'])

# Set 'Admission Date' as the index of the DataFrame
df.set_index('Admission Date', inplace=True)

# Previewing the DataFrame with 'Admission Date' as the index
df.head()

Unnamed: 0_level_0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Discharge Date,Doctor
Admission Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-01-05,1001,John,Doe,45,Male,Heart Disease,Medication,2020-01-15,Dr. Smith
2019-11-10,1002,Jane,Smith,32,Female,Diabetes,Diet Control,2019-11-20,Dr. Johnson
2020-03-01,1003,Michael,Johnson,50,Male,High Blood Pressure,Exercise,2020-03-10,Dr. Brown
2019-08-15,1004,Sarah,Williams,28,Female,Anxiety,Counseling,2019-08-25,Dr. Lee
2020-05-10,1005,David,Anderson,60,Male,Arthritis,Physical Therapy,2020-05-20,Dr. White


## 3. Using map() for Column Transformations

We use the map() function in Pandas to apply a transformation to each element in a specific column of a DataFrame. By using map(), we can efficiently update values in a column based on a predefined mapping or function. For example, we can convert categorical values to numerical values, standardize units of measurement, or apply custom calculations to each element in a column.

A real-life use case for using map() is when we need to convert a column of string values representing different categories into corresponding numerical codes for machine learning models.

To apply map() for column transformations, we first define a mapping dictionary or a function that specifies how each value in the column should be transformed. Then, we use the map() function on the desired column, passing the mapping dictionary or function as an argument to perform the transformation.



In [None]:
# Load the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/e/patients_27ihw.csv')
df.head()

In [22]:
# Define a mapping dictionary to convert Gender to numerical values
gender_mapping = {'Male': 0, 'Female': 1}

# Apply the map() function to transform the 'Gender' column using the mapping dictionary
df['Gender'] = df['Gender'].map(gender_mapping)

# Preview the updated DataFrame
df.head()

Unnamed: 0_level_0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Discharge Date,Doctor
Admission Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-01-05,1001,John,Doe,45,0,Heart Disease,Medication,2020-01-15,Dr. Smith
2019-11-10,1002,Jane,Smith,32,1,Diabetes,Diet Control,2019-11-20,Dr. Johnson
2020-03-01,1003,Michael,Johnson,50,0,High Blood Pressure,Exercise,2020-03-10,Dr. Brown
2019-08-15,1004,Sarah,Williams,28,1,Anxiety,Counseling,2019-08-25,Dr. Lee
2020-05-10,1005,David,Anderson,60,0,Arthritis,Physical Therapy,2020-05-20,Dr. White


### <font color="green">Challenge</font>

In the dataset provided at the URL: https://afterwork.ai/ds/ch/clinics_gvu57.csv, there is a column named 'Doctor Title' that contains different titles such as 'General Practitioner', 'Pediatrician', 'Dentist', etc. Your task is to create a Python code snippet that uses the map() function to encode these titles into numerical codes. Define a mapping dictionary that assigns a unique numerical code to each title and apply this transformation to the 'Doctor Title' column.


In [30]:
# Load the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/ch/clinics_gvu57.csv')
unique_titles = df['Doctor Title'].unique()

In [32]:
# Define a mapping dictionary to convert Doctor Title to numerical values
unique_titles = np.unique(df['Doctor Title'])

# Apply the map() function to transform the 'Doctor Title' column using the mapping dictionary
title_to_code = {title: code for code, title in enumerate(unique_titles)}
df['Doctor Title'] = df['Doctor Title'].map(title_to_code)
# Preview the updated DataFrame

df.head(2)

Unnamed: 0,Clinic ID,Clinic Name,Location,Specialty,Doctor Name,Doctor Title,Doctor Email,Contact Number,Opening Hours,Rating
0,101,City Health Clinic,Downtown,General Medicine,Dr. Smith,13,drsmith@example.com,123-456-7890,8am-5pm,4.5
1,102,Sunrise Medical Center,Uptown,Pediatrics,Dr. Johnson,29,drjohnson@example.com,234-567-8901,9am-6pm,4.2


## 4. Manipulating the Index with Resetting and Setting in Pandas

We manipulate the index of a DataFrame by resetting and setting it using the Pandas library. Resetting the index means converting the current index into a column and replacing it with a default integer index. This can be useful when we want to remove the current index or when we need to reset the index after filtering or sorting operations.

On the other hand, setting the index allows us to specify a column or a combination of columns as the new index of the DataFrame. This is helpful when we want to organize the data based on specific columns or when we need to perform operations that require a unique identifier as the index. For example, in a healthcare dataset, we can reset the index to default integers after filtering out irrelevant rows, and then set the 'Patient ID' column as the new index to easily access and analyze patient-specific data.



In [33]:
# Loading the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/e/patients_zsu3g.csv')

# Previewing the original DataFrame
df.head()

Unnamed: 0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Admission Date,Discharge Date,Room Number
0,101,Emily,Smith,45,Female,Hypertension,Medication,2021-05-10,2021-05-15,101
1,102,James,Johnson,60,Male,Diabetes,Diet and Exercise,2021-06-02,2021-06-10,102
2,103,Sarah,Williams,35,Female,Anxiety,Therapy,2021-07-15,2021-07-20,103
3,104,Michael,Brown,50,Male,Arthritis,Physical Therapy,2021-08-03,2021-08-15,104
4,105,Linda,Anderson,55,Female,Depression,Medication and Counseling,2021-09-20,2021-09-30,105


In [34]:
# Resetting the index to default integers
df_reset = df.reset_index()

# Previewing the DataFrame after resetting the index
df_reset.head()

Unnamed: 0,index,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Admission Date,Discharge Date,Room Number
0,0,101,Emily,Smith,45,Female,Hypertension,Medication,2021-05-10,2021-05-15,101
1,1,102,James,Johnson,60,Male,Diabetes,Diet and Exercise,2021-06-02,2021-06-10,102
2,2,103,Sarah,Williams,35,Female,Anxiety,Therapy,2021-07-15,2021-07-20,103
3,3,104,Michael,Brown,50,Male,Arthritis,Physical Therapy,2021-08-03,2021-08-15,104
4,4,105,Linda,Anderson,55,Female,Depression,Medication and Counseling,2021-09-20,2021-09-30,105


In [35]:
# Setting the 'Patient ID' column as the new index
df_set = df.set_index('Patient ID')

# Previewing the DataFrame after setting the 'Patient ID' as the index
df_set.head()

Unnamed: 0_level_0,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Admission Date,Discharge Date,Room Number
Patient ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
101,Emily,Smith,45,Female,Hypertension,Medication,2021-05-10,2021-05-15,101
102,James,Johnson,60,Male,Diabetes,Diet and Exercise,2021-06-02,2021-06-10,102
103,Sarah,Williams,35,Female,Anxiety,Therapy,2021-07-15,2021-07-20,103
104,Michael,Brown,50,Male,Arthritis,Physical Therapy,2021-08-03,2021-08-15,104
105,Linda,Anderson,55,Female,Depression,Medication and Counseling,2021-09-20,2021-09-30,105


## 5. Chunking Large Datasets for Efficient Processing

We chunk large datasets into smaller, more manageable pieces to optimize processing speed and memory usage. By breaking down the dataset into chunks, we can perform operations on each smaller portion individually, reducing the strain on system resources. For example, in healthcare analytics, we may have a massive dataset of patient records that we need to analyze. By chunking the data, we can process one segment at a time, allowing us to handle large volumes of information without overwhelming the system.

To apply chunking in Pandas, we can use the 'chunksize' parameter when reading a large dataset with 'pd.read_csv()', which divides the data into chunks based on the specified size.



In [36]:
# Define the URL for the dataset
url = 'https://afterwork.ai/ds/e/patients_v5wmh.csv'

# Specify the chunk size for processing
chunk_size = 3

# Read the dataset in chunks using pd.read_csv()
for chunk in pd.read_csv(url, chunksize=chunk_size):
    # Perform operations on each chunk
    print(chunk)

   Patient ID First Name Last Name  Age  Gender      Diagnosis   Treatment  \
0        1001       John       Doe   45    Male  Heart Disease  Medication   
1        1002       Jane     Smith   32  Female      Influenza        Rest   
2        1003    Michael   Johnson   60    Male       Diabetes        Diet   

  Admission Date Discharge Date       Doctor  
0     2020-05-10     2020-05-20    Dr. Smith  
1     2020-06-15     2020-06-20  Dr. Johnson  
2     2020-07-01     2020-07-10    Dr. Brown  
   Patient ID First Name Last Name  Age  Gender            Diagnosis  \
3        1004      Sarah  Williams   28  Female        Fractured Leg   
4        1005      David  Anderson   50    Male  High Blood Pressure   
5        1006      Emily     Brown   42  Female              Anxiety   

    Treatment Admission Date Discharge Date      Doctor  
3     Surgery     2020-08-05     2020-08-15     Dr. Lee  
4    Exercise     2020-09-10     2020-09-20  Dr. Wilson  
5  Counseling     2020-10-15     202

### <font color="green">Challenge</font>

Write a Python code snippet that reads the dataset from the following URL: https://afterwork.ai/ds/ch/clinics_ul0qi.csv and displays the chunks of data without performing any operations. Make use of the 'chunksize' parameter in 'pd.read_csv()' to chunk the dataset.


In [39]:
# Define the URL for the dataset
url = 'https://afterwork.ai/ds/ch/clinics_ul0qi.csv'

# Specify the chunk size for processing
chunk_size = 10

# Read the dataset in chunks using pd.read_csv()
for chunk in pd.read_csv(url, chunksize=chunk_size):

    # Perform operations on each chunk
    print(chunk)


   Clinic ID      Clinic Name       Location         Specialty  Doctors  \
0        101   Central Clinic       New York  General Medicine        5   
1        102  Westside Clinic    Los Angeles       Dermatology        3   
2        103  East End Clinic        Chicago        Pediatrics        4   
3        104     South Clinic          Miami        Cardiology        6   
4        105     North Clinic        Seattle       Orthopedics        5   
5        106  Downtown Clinic  San Francisco        Obstetrics        4   
6        107    Uptown Clinic         Boston     Ophthalmology        3   
7        108  Suburban Clinic         Dallas               ENT        5   
8        109     Rural Clinic         Denver           Urology        2   
9        110     Metro Clinic        Phoenix        Psychiatry        4   

   Nurses  Patients  Appointments  Revenue  Rating  
0      10       200           500   100000     4.5  
1       8       150           400    90000     4.2  
2       6      

## 6. Counting Distinct Elements in a DataFrame

We count distinct elements in a DataFrame to determine the unique values present in a specific column. This helps us understand the diversity and uniqueness of data entries. For example, in a healthcare dataset, we may want to count the distinct patient IDs to identify the total number of unique patients.

To apply this concept, we use the Pandas library in Python by utilizing the nunique() method on a specific column. This method returns the count of unique elements in that column. By counting distinct elements, we can gain insights into the variety and individuality of data points, which is crucial for accurate analysis and decision-making.

In [40]:
# Load the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/e/patients_6l0x9.csv')
df.head()

Unnamed: 0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Admission Date,Discharge Date,Doctor
0,1001,John,Doe,45,Male,Heart Disease,Medication,2020-05-10,2020-05-20,Dr. Smith
1,1002,Jane,Smith,32,Female,Flu,Rest,2020-06-15,2020-06-20,Dr. Johnson
2,1003,Michael,Johnson,50,Male,Diabetes,Diet,2020-07-01,2020-07-10,Dr. Brown
3,1004,Sarah,Williams,28,Female,Allergy,Medication,2020-08-05,2020-08-15,Dr. Lee
4,1005,David,Anderson,60,Male,Arthritis,Physical Therapy,2020-09-10,2020-09-25,Dr. White


In [41]:
# Counting distinct elements in the 'Patient ID' column
distinct_patient_count = df['Patient ID'].nunique()

# Display the count of distinct patients
print('Total number of distinct patients:', distinct_patient_count)

Total number of distinct patients: 75


## 7. Optimizing Data Processing with eval()

We optimize data processing by using the eval() method in Pandas. This method allows us to evaluate and execute dynamic Python expressions on DataFrame objects efficiently. We use eval() to perform complex operations on large datasets quickly, improving the performance of our code. For example, we can use eval() to filter, transform, or calculate new columns based on specific conditions without the need for multiple intermediate steps.

In a real-life scenario, we can apply eval() when analyzing healthcare data to calculate aggregated statistics, filter out specific patient groups, or create new variables based on medical conditions.

To apply eval(), we provide the expression as a string parameter to the eval() method, which then evaluates the expression in the context of the DataFrame, returning the result seamlessly.



In [43]:
# Load the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/e/patients_hg17u.csv')
df.head()

Unnamed: 0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Admission Date,Discharge Date,Doctor
0,1001,John,Doe,45,Male,Heart Disease,Medication,2020-05-10,2020-05-20,Dr. Smith
1,1002,Jane,Smith,32,Female,Diabetes,Insulin,2020-06-15,2020-06-25,Dr. Johnson
2,1003,Michael,Johnson,50,Male,High Blood Pressure,Exercise,2020-07-20,2020-07-30,Dr. Brown
3,1004,Sarah,Williams,28,Female,Anxiety,Counseling,2020-08-05,2020-08-15,Dr. Lee
4,1005,David,Anderson,60,Male,Arthritis,Physical Therapy,2020-09-10,2020-09-20,Dr. Wilson


In [44]:
# Use eval() to filter patients with age greater than 40
mask = df.eval('Age > 40')

# Display the filtered DataFrame
df[mask]

Unnamed: 0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Admission Date,Discharge Date,Doctor
0,1001,John,Doe,45,Male,Heart Disease,Medication,2020-05-10,2020-05-20,Dr. Smith
2,1003,Michael,Johnson,50,Male,High Blood Pressure,Exercise,2020-07-20,2020-07-30,Dr. Brown
4,1005,David,Anderson,60,Male,Arthritis,Physical Therapy,2020-09-10,2020-09-20,Dr. Wilson
5,1006,Emily,Brown,42,Female,Migraine,Medication,2020-10-15,2020-10-25,Dr. Martinez
7,1008,Laura,Davis,48,Female,Depression,Therapy,2020-12-05,2020-12-15,Dr. Adams
8,1009,Robert,Jones,55,Male,Cancer,Chemotherapy,2021-01-10,2021-01-20,Dr. White
10,1011,William,Clark,65,Male,Stroke,Rehabilitation,2021-03-20,2021-03-30,Dr. Green
12,1013,Charles,Young,47,Male,Obesity,Diet Plan,2021-05-10,2021-05-20,Dr. Lee
14,1015,Matthew,Scott,52,Male,Chronic Pain,Physical Therapy,2021-07-20,2021-07-30,Dr. Adams
17,1018,Grace,Baker,58,Female,Alzheimer's Disease,Memory Exercises,2021-10-15,2021-10-25,Dr. Garcia


In [45]:
# Use eval() to calculate the average age of patients
average_age = df.eval('Age.mean()')

# Display the average age
print('Average Age:', average_age)

Average Age: 42.36


### <font color="green">Challenge</font>

Given the dataset of clinics available at the URL: https://afterwork.ai/ds/ch/clinics_uwcpz.csv, create a code snippet using the eval() method in Pandas to calculate the average number of patients per doctor across all clinics. You should use the 'Patients' and 'Doctors' columns from the dataset.


In [46]:
# Load the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/ch/clinics_uwcpz.csv')
df.head(2)

Unnamed: 0,Clinic ID,Clinic Name,Location,Specialty,Doctors,Nurses,Patients,Appointments,Revenue,Rating
0,101,City Health Clinic,New York,General Medicine,5,10,200,500,100000,4.5
1,102,Sunshine Family Clinic,Los Angeles,Pediatrics,3,5,150,400,80000,4.2


In [47]:
# Use eval() to calculate the average number of patients per doctor across all clinics
df['patients_per_doc'] = df.eval('Patients / Doctors')
average_patient_per_doc = df['patients_per_doc'].mean()

# Display the average number of patients per doctor
print(f"The average num of patiens per doc is {average_patient_per_doc}")


The average num of patiens per doc is 43.346666666666664


## 8. Selecting Specific Columns with loc

We select specific columns with the loc method in Pandas to extract only the columns we are interested in from a DataFrame. This helps us focus on relevant data and ignore unnecessary columns. For example, in a healthcare dataset, we may want to extract columns related to patient demographics and medical history for further analysis.

By using loc to select specific columns, we can easily access and work with the required information without cluttering our analysis with irrelevant data.

To apply this concept, we use the loc method followed by specifying the column names we want to select within square brackets, like df.loc[:, ['column1', 'column2']]. This allows us to filter the DataFrame and retain only the specified columns for our analysis.



In [48]:
# Load the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/e/patients_v0gy5.csv')
df.head()

Unnamed: 0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment,Admission Date,Discharge Date,Doctor
0,1001,John,Doe,45,Male,Heart Disease,Medication,2020-05-15,2020-06-10,Dr. Smith
1,1002,Jane,Smith,32,Female,Influenza,Rest,2020-07-20,2020-07-25,Dr. Johnson
2,1003,Michael,Johnson,60,Male,Diabetes,Diet,2020-04-10,2020-05-05,
3,1004,Emily,Williams,28,Female,Fractured Leg,Surgery,2020-08-01,2020-08-15,Dr. Brown
4,1005,David,Brown,50,Male,High Blood Pressure,Exercise,2020-03-05,2020-03-20,Dr. Lee


In [49]:
# Select specific columns using loc
selected_columns = df.loc[:, ['Patient ID', 'First Name', 'Last Name', 'Age', 'Gender', 'Diagnosis', 'Treatment']]

# Display the selected columns
selected_columns

Unnamed: 0,Patient ID,First Name,Last Name,Age,Gender,Diagnosis,Treatment
0,1001,John,Doe,45,Male,Heart Disease,Medication
1,1002,Jane,Smith,32,Female,Influenza,Rest
2,1003,Michael,Johnson,60,Male,Diabetes,Diet
3,1004,Emily,Williams,28,Female,Fractured Leg,Surgery
4,1005,David,Brown,50,Male,High Blood Pressure,Exercise
...,...,...,...,...,...,...,...
70,1071,Michael,Foster,45,Male,Anemia,Vitamins
71,1072,Sophia,Bailey,56,Female,Acid Reflux,Diet
72,1073,James,Gray,39,Male,High Cholesterol,Exercise
73,1074,Olivia,Reed,53,Female,Allergies,Medication


### <font color="green">Challenge</font>

Given the dataset of clinics available at the following URL: https://afterwork.ai/ds/ch/clinics_xzwhl.csv, write a Python code snippet using Pandas to select only the columns 'Clinic Name', 'Specialty', 'Doctor Name', and 'Rating' from the dataset.


In [50]:
# Load the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/ch/clinics_xzwhl.csv')
df.head()

Unnamed: 0,Clinic ID,Clinic Name,Location,Specialty,Doctor Name,Doctor Title,Doctor Email,Contact Number,Opening Hours,Rating
0,101,City Clinic,Austin,General Medicine,Dr. Smith,General Practitioner,drsmith@example.com,123-456-7890,8am-5pm,4.5
1,102,Healthy Living Clinic,Dallas,Dermatology,Dr. Johnson,Dermatologist,drjohnson@example.com,234-567-8901,9am-6pm,4.8
2,103,Family Wellness Center,Houston,Pediatrics,Dr. Brown,Pediatrician,drbrown@example.com,345-678-9012,10am-7pm,4.2
3,104,Elite Dental Care,San Antonio,Dentistry,Dr. Lee,Dentist,drlee@example.com,456-789-0123,8:30am-4:30pm,4.7
4,105,Vision Care Center,Austin,Ophthalmology,Dr. Garcia,Ophthalmologist,drgarcia@example.com,567-890-1234,9:30am-5:30pm,4.6


In [51]:
# Select specific columns using loc
select_cols = ['Clinic Name', 'Specialty', 'Doctor Name','Rating' ]

# Display the selected columns
df.loc[:,select_cols]


Unnamed: 0,Clinic Name,Specialty,Doctor Name,Rating
0,City Clinic,General Medicine,Dr. Smith,4.5
1,Healthy Living Clinic,Dermatology,Dr. Johnson,4.8
2,Family Wellness Center,Pediatrics,Dr. Brown,4.2
3,Elite Dental Care,Dentistry,Dr. Lee,4.7
4,Vision Care Center,Ophthalmology,Dr. Garcia,4.6
...,...,...,...,...
75,Heart Care Center,Cardiology,Dr. Powell,4.6
76,Allergy Relief Clinic,Allergy & Immunology,Dr. Nguyen,4.8
77,Senior Living Community,Geriatrics,Dr. Rivera,4.4
78,Hearing Aid Solutions,Audiology,Dr. Hughes,4.3


## 10. Viewing Summary Statistics of Data

 By examining summary statistics, we can understand the central tendency, dispersion, and shape of our data distribution. This helps us identify any outliers, assess data quality, and make informed decisions about further data processing. For example, we can quickly check the mean, median, standard deviation, minimum, maximum, and quartiles of numerical columns.

 To view summary statistics in Pandas, we use the describe() method on a DataFrame. We can apply this concept to a healthcare dataset to understand the distribution of patient age, blood pressure, or cholesterol levels, which can aid in identifying trends or anomalies in the data.




In [52]:
# Loading the dataset from the provided URL
df = pd.read_csv('https://afterwork.ai/ds/e/patients_wkp97.csv')
df.head()

Unnamed: 0,Patient ID,Age,Gender,Height,Weight,Blood Pressure,Cholesterol Level,Heart Rate,Temperature,Symptoms
1001,45,Male,175,70,120/80,Normal,70,98.6,Cough,Fever
1002,32,Female,160,55,110/70,Normal,65,98.2,Headache,Fatigue
1003,50,Male,180,85,130/85,High,75,99.0,Shortness of Breath,Chest Pain
1004,28,Female,165,60,115/75,Normal,60,98.0,Sore Throat,Runny Nose
1005,65,Female,155,75,140/90,High,80,99.5,Fatigue,Dizziness


In [53]:
# Displaying summary statistics of the dataset
summary_stats = df.describe()
summary_stats

Unnamed: 0,Patient ID,Gender,Height,Cholesterol Level,Heart Rate
count,75.0,75.0,75.0,75.0,75.0
mean,46.853333,169.6,76.52,71.253333,98.730667
std,12.924619,8.860785,9.290158,6.625286,0.46411
min,25.0,150.0,55.0,60.0,98.0
25%,36.5,163.5,68.0,66.0,98.4
50%,46.0,172.0,75.0,72.0,98.7
75%,56.5,177.0,85.0,76.0,99.1
max,74.0,182.0,92.0,82.0,99.6
