### LSE Data Analytics Online Career Accelerator

# Course 2: Data Analytics using Python

## 0) Assignment: Diagnostic Analysis using Python

You’ll be working with real-world data to address a problem faced by the National Health Service (NHS). The analysis will require you to utilise Python to explore the available data, create visualisations to explore and communicate trends, and extract meaningful insights to inform decision-making. This Jupyter Notebook is the starting point and it is important to document all your decisions and observations to ensure that these are available as inputs to the technical report and business presentation that will form part of your submission.

### A note for students using this template
This Jupyter Notebook is a template you can use to complete the Course 2 assignment: Diagnostic Analysis using Python. 
Keep in mind the following points: 
- Using this template as your working document is optional, it is **not required** to use this template to complete the assignment. 
- The intention of the template is to provide suggestions regarding the structure and workflow that is expected and it follows the assignment activites throughout the course.
- Refer to the guidance in the Assignment Activity pages for specific details. 
- The markup and comments in this template identify the key elements you need to complete before submitting the assignment.
- Make this notebook your own by adding comments, cells, and content to reflect your analytical journey. You can add links, screenshots, or images to support your analysis, refine or clarify the questions, and change the workflow to suit your process. Important elements include:
    - code comments
    - Markdown cells with your observations, interpretation, and notes in anticipation of the technical report and business presentation.
- All code output and visualisations should be functional and visible in the submitted Jupyter Notebook. 
- If you decide to use this template for your assignment, make a copy of the notebook and save it using the assignment naming convention: **LastName_FirstName_DA201_Assignment_Notebook.ipynb**.
- Be sure to save frequent snapshots of your Jupyter Notebook to ensure that you can recover work if required.

 > ***Markdown*** Remember to change cell types to `Markdown`. You can review [Markdown basics](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax) to find out how to add formatted text, links, and images to your notebook.
 
 
 > ***Notebook state*** Remember that you will have to execute all the code in your notebook (from the start of the notebook to where you are currently working) every time you restart your Jupyter Notebook server or when working in a new session. Make sure that your notebook is in the correct state before continuing with the work for the current module.

# 

## 1) Assignment activity 1
In the first module you are encouraged to explore the data and the provided template. You should also reflect on the scenario and use case and start to document your own interpretation of the business questions and how you will go about answering them.

In [None]:
# Document your planned analytical approach.

> ***Check*** that you have adequately satisfied the expectations of the current module and that you have added code comments as well as Markdown cells documenting your analytic journey and observations to satisfy the assessment criteria.

# 

## 2) Assignment activity 2
**Basic exploration and descriptive statistics.**
- Import the three indicated data sources and perform basic exploratory analysis including obtaining descriptive statistics.
- Determine and comment on the quality, usefulness, and opportunities contained in the data sets.
- Document initial observations and findings.

Are there any comments regarding data quality or descriptive statistics worth noting for each of the data sets?
Can you comment on other features (columns) that could potentially be useful in your analysis?

### Prepare your workstation

In [None]:
# Import the necessary libraries.
import pandas as pd
import numpy as np

# Import other libraries if required. (Note that you can revisit this section in later modules.)



# Optional - Ignore warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Import and sense-check 'actual_duration.csv' as ad.
ad = 


In [None]:
# View the DataFrame.


In [None]:
# Check for missing values.


In [None]:
# Review metadata and descriptive statistics.


In [None]:
# Import and sense-check 'appointments_regional.csv' as ar.
ar = 


In [None]:
# View the DataFrame.


In [None]:
# Check for missing values.


In [None]:
# Review metadata and descriptive statistics.


In [None]:
# Import and sense-check 'national_categories.xlsx' as nc.
nc = 


In [None]:
# View the DataFrame.



In [None]:
# Check for missing values.


In [None]:
# Review metadata and descriptive statistics.


### Exploration suggestions:
Make sure to supplement the list with additional questions and observations that you have identified during modules 1 and 2.

**Question 1:** How many locations are there in the data set?

In [None]:
# Determine the number of locations.


**Question 2:** What are the five locations with the highest number of records?



In [None]:
# Determine the top five locations based on record count.


**Question 3:** How many service settings, context types, national categories, and appointment statuses are there?

In [None]:
# Determine the number of service settings.


In [None]:
# Determine the number of context types.


In [None]:
# Determine the number of national categories.


In [None]:
# Determine the number of appointment statuses.


> ***Check*** that you have adequately satisfied the expectations of the current module and that you have added code comments as well as Markdown cells documenting your analytic journey and observations to satisfy the assessment criteria.

# 

## 3) Assignment activity 3
Continue your exploratory data analysis by answering the specific questions posed by the NHS as well as additional trends identified during data exploration. You can start by looking at the date range and the change in behaviour over time for the available data sources.

### Continue to explore the data and search for answers to more specific questions posed by the NHS.

**Question 1:** Between what dates were appointments scheduled? 

In [None]:
# View the first five rows of appointment_date for the ad DataFrame to determine the date format.


In [None]:
# View the first five rows of appointment_date for the nc DataFrame to determine the date format.


In [None]:
# Change the date format of ad['appointment_date'].


# View the DataFrame.


In [None]:
# Change the format of ar['appointment_date'] to datetime.


# View the DataFrame.


In [None]:
# Determine the minimum and maximum dates in the ad DataFrame.
# Use appropriate docstrings.


In [None]:
# Determine the minimum and maximum dates in the nc DataFrame.
# Use appropriate docstrings.


**Question 2:** Which service setting was the most popular for NHS North West London from 1 January to 1 June 2022?

In [None]:
# For each of these service settings, determine the number of records available for the period and the location. 


# View the output.


**Question 3:** Which month had the highest number of appointments?

In [None]:
# Number of appointments per month == sum of count_of_appointments by month.
# Use the groupby() and sort_values() functions.


**Question 4:** What was the total number of records per month?

In [None]:
# Total number of records per month.


In [None]:
# Your additional questions and insights.

> ***Check*** that you have adequately satisfied the expectations of the current module and that you have added code comments as well as Markdown cells documenting your analytic journey and observations to satisfy the assessment criteria.

# 

## 4) Assignment activity 4
Use visual techniques to explore and communicate patterns in the data. Note that you will likely revisit this section when preparing the final visualisations to be used in your technical report and business presentation. Make sure to document your thoughts and observations as they relate to various potential stakeholders.

The seasons are summer (June to August 2021), autumn (September to November 2021), winter (December to February 2022), and spring (March to May 2022).

### Create visualisations and identify possible monthly and seasonal trends in the data.

In [None]:
# Import the necessary libraries.
import seaborn as sns
import matplotlib.pyplot as plt

# Set figure size.
sns.set(rc={'figure.figsize':(15, 12)})

# Set the plot style as white.
sns.set_style('white')

### Objective 1
Create three visualisations indicating the number of appointments per month for service settings, context types, and national categories.

In [None]:
# Change the data type of the appointment month to string to allow for easier plotting.


In [None]:
# Aggregate on a monthly level and determine the sum of records per month.


# View the output.


**Service settings:**

In [None]:
# Plot the appointments over the available date range, and review the service settings for months.
# Create a lineplot.


**Context types:**

In [None]:
# Create a separate data set that can be used in future weeks. 


# View the output.


In [None]:
# Plot the appointments over the available date range, and review the context types for months.
# Create a lineplot.


**National categories:**

In [None]:
# Create a separate data set that can be used in future weeks. 


# View the output.


In [None]:
# Plot the appointments over the available date range, and review the national categories for months.
# Create a lineplot.


### Objective 2
Create four visualisations indicating the number of appointments for service setting per season.

**Summer:**


In [None]:
# Create a separate data set that can be used in future weeks. 

# View the output.


In [None]:
# Visualise the subset using a lineplot.


**Autumn:**

In [None]:
# Create a separate data set that can be used in future weeks. 

# View the output.


In [None]:
# Visualise the subset using a lineplot.


**Winter:**

In [None]:
# Create a separate data set that can be used in future weeks. 

# View the output.


In [None]:
# Visualise the subset using a lineplot.


**Spring:**

In [None]:
# Create a separate data set that can be used in future weeks. 

# View the output.


In [None]:
# Visualise the subset using a lineplot.


> ***Check*** that you have adequately satisfied the expectations of the current module and that you have added code comments as well as Markdown cells documenting your analytic journey and observations to satisfy the assessment criteria.

# 

## 5) Assignment activity 5

### Analyse tweets from Twitter with hashtags related to healthcare in the UK.

In [None]:
# Libraries and settings needed for analysis.
import pandas as pd
import seaborn as sns

# Set the figure size.
sns.set(rc={'figure.figsize':(15, 12)})

# Set the plot style as white.
sns.set_style('white')

# Maximum column width to display.
pd.options.display.max_colwidth = 200

In [None]:
# Load the data set.


# View the DataFrame.


In [None]:
# Explore the metadata and data set.


In [None]:
# Would it be useful to only look at retweeted and favourite tweet messages?
# Explain your answer.


In [None]:
# Create a new DataFrame containing only the text.


# View the DataFrame.


In [None]:
# Loop through the messages, and create a list of values containing the # symbol.


In [None]:
# Display the first 30 records.


In [None]:
# Convert the series to a DataFrame in preparation for visualisation.


# Rename the columns.


In [None]:
# Fix the count datatype.


# View the result.


In [None]:
# Display records where the count is larger than 10.


In [None]:
# Create a Seaborn barplot displaying records with a count larger than 10.


In [None]:
# Create the plot.


# View the barplot.


 > ***Check*** that you have adequately satisfied the expectations of the current module and that you have added code comments as well as Markdown cells documenting your analytic journey and observations to satisfy the assessment criteria.

# 

## 6) Assignment activity 6
In the final module you will answer additional questions from the NHS as well as additional questions and observations you identified. Make sure to revisit previous sections that may provide useful insights to the questions posed in Module 6 where required.

### Investigate the main concerns posed by the NHS. 

In [None]:
# Prepare your workstation.
# Load the appointments_regional.csv file.


# View the DataFrame.


In [None]:
# Print the min and max dates.


In [None]:
# Filter the data set to only look at data from 2021-08 onwards.


**Question 1:** Should the NHS start looking at increasing staff levels? 

In [None]:
# Create an aggregated data set to review the different features.


# View the DataFrame.


In [None]:
# Determine the total number of appointments per month.


# Add a new column to indicate the average utilisation of services.
# Monthly aggregate / 30 to get to a daily value.


# View the DataFrame.


In [None]:
# Plot sum of count of monthly visits.
# Convert the appointment_month to string data type for ease of visualisation.


# Create a lineplot with Seaborn.


In [None]:
# Plot monthly capacity utilisation.


# Create a lineplot.


**Question 2:** How do the healthcare professional types differ over time?

In [None]:
# Create a lineplot to answer the question.


**Question 3:** Are there significant changes in whether or not visits are attended?

In [None]:
# Create a lineplot to answer the question.


**Question 4:** Are there changes in terms of appointment type and the busiest months?

In [None]:
# Create a lineplot to answer the question.


**Question 5:** Are there any trends in time between booking an appointment?

In [None]:
# Create a lineplot to answer the question.


**Question 6:** How does the spread of service settings compare?

In [None]:
# Let's go back to the national category DataFrame you created in an earlier assignment activity.


In [None]:
# Create a new DataFrame consisting of the month of appointment and the number of appointments.

# View the DataFrame.


In [None]:
# Create a boxplot to investigate the spread of service settings.


In [None]:
# Create a boxplot to investigate the service settings without GP.


### Provide a summary of your findings and recommendations based on the analysis.

> Double-click to insert your summary.

> ***Check*** that you have adequately satisfied the expectations of the current module and that you have added code comments as well as Markdown cells documenting your analytic journey and observations to satisfy the assessment criteria.