<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Introduction to Python for Astronomers Notebook
  </h1>
</div>

Last updated: 11 Dec 2024

**This tutorial was created by Michael Cowley with inspiration from “Python for Astronomers” by Imad Pasha.**

<br>
<img src="https://mjcowley.github.io/images/python.png" alt="Python Logo" width="400"/>

Welcome to this introductory notebook designed to familiarise newcomers with the essentials of [Python](https://www.python.org/), particularly in the contexts of astronomy and astrophysics. Python is renowned for its readability, simplicity, and the extensive range of libraries it offers, such as [NumPy](https://numpy.org/) for numerical computations, [pandas](https://pandas.pydata.org/) for data manipulation, and [Matplotlib](https://matplotlib.org/) for visualisation. These tools significantly simplify the tasks of data analysis and visualisation. This guide aims to introduce you to the fundamental concepts of Python, showcase these critical libraries, and their practical uses within Google Colab.

<br>
<img src="https://mjcowley.github.io/images/colab.png" alt="Colab Logo" width="400"/>

**Google Colab**, or **Colaboratory**, presented by Google Research, is a cloud-based [Jupyter notebook](https://jupyter.org/) service that allows the execution of Python (and other languages) directly in your web browser, with no setup required. It’s a valuable resource for collaborative projects, education, or any data science projects, thanks to its ease of use and no-cost access. Colab integrates seamlessly with Google Drive, making it straightforward to store, access, and share your datasets, notebooks, and other files.

For beginners to cloud-based development environments, starting with Google Colab is an excellent introduction to the world of collaborative coding platforms. Utilising a Google account to store and manage your datasets in Google Colab is perhaps the most straightforward method for those new to the platform. If you don’t have a Google account yet, you can sign up for free [here](https://www.google.com/account/about/).

However, **postgraduate students or more advanced users** of Python are encouraged to use their own Integrated Development Environment (IDE) for more complex projects and greater flexibility in development. Examples of powerful IDEs for Python include:
- [**Visual Studio Code** (VS Code)](https://code.visualstudio.com/)
- [**PyCharm**](https://www.jetbrains.com/pycharm/)
- [**Spyder**](https://www.spyder-ide.org/)

These IDEs offer advanced features such as debugging, code linting, and version control integration, which are useful for larger or more complex coding tasks.

Finally, be sure to [watch the instructional video here](https://www.youtube.com/watch?v=inN8seMm7UI) if you need a refresher on Google Colab.


## Basic Operations

Let's begin by understanding what a *declaration* is. Python acts as a highly advanced calculator designed for complex calculations, and many Python programs are sequences of simple mathematical operations executed on data. To handle more intricate data than what you'd input into a standard calculator, Python enables the creation of variables. These variables store values for future reference. Run each of the cells in sequence below:

In [None]:
variable_1 = 2

In [None]:
variable_2 = 3

In [None]:
output_1 = variable_1 + variable_2

In [None]:
print(output_1)

I simply saved the numbers 2 and 3 into variables, choosing the basic names 'variable_1' and 'variable_2', and then calculated their sum. Notice the underscore in the variable names. **Spaces_aren't_allowed_in_variable_names**, so underscores are commonly used instead. We'll discuss variable naming best practices later on.

You might wonder, ''Why bother declaring those variables and then adding them, instead of directly calculating:''

In [None]:
2+3

In this instance, you're spot on. If my goal was simply to find the sum of 2 and 3, I could have directly inputted it. Moreover, should there have been a need to store that result, I could have easily done so:

In [None]:
output_2 = 2+3

And now, I can at any time look at that:

In [None]:
print(output_2)

Or, I can use it in further calculations:

In [None]:
output_3 = output_1 + output_2

In [None]:
print(output_3)

This is a great opportunity to explore the basic mathematical operations Python offers, two of which we've already encountered. The following lists the basic mathematical operations in Python:

In [None]:
# Perform basic mathematical operations <- this is a comment, which the cell ignores! Use the "#" symbol to comment in Python.
addition = 5 + 5
subtraction = 5 - 5
multiplication = 5 * 5
division = 5 / 6 
exponentiation = 5**2
modulus = 5 % 3

In [None]:
# Print results with labels
print("Addition: ", addition)
print("Subtraction: ", subtraction)
print("Multiplication: ", multiplication)
print("Division: ", division)
print("Exponentiation: ", exponentiation)
print("Modulus: ", modulus)

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Basic Operations
  </h1>
</div>

In the box below, create three variables which hold your age and the ages of two other people you know. Then, set a variable named "age_average" that is equal to the average of your three ages. Be careful of order of operations! You can group operations, just like in [PEMDAS](https://www.mathsisfun.com/operation-order-pemdas.html) math, using soft parenthesis "()".

In [None]:
# Your code here

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>
</div>

## Comments
Comments are an essential aspect of programming, allowing you to annotate your code with explanations, instructions, or reminders. Comments are ignored by the Python interpreter, meaning they won't affect your code's execution. Comments are preceded by the `#` symbol, which tells Python to disregard the text that follows. Let's see an example:

In [None]:
# This is a comment
print("This is not a comment")

## Data Types in Python

Until now, we've only dealt with numeric data types, namely integers and floats. To check the data type of a variable at any given time, you can use the `type()` function.


In [3]:
x = 2
y = 3.0
print(type(x))
print(type(y))

<class 'int'>
<class 'float'>


### Casting
Python allows you to convert data types using a process called **casting**. This is particularly useful when you need to change a variable's type to perform a specific operation. For example, you can convert an integer to a float, or vice versa. Let's see how this works:

In [4]:
# Casting an integer to a float
x = 2 # This is an integer
y = float(x)
print(y)

2.0


Casting an integer to a float is straightforward. The `float()` function converts the integer `2` to a float, resulting in `2.0`. Similarly, you can convert a float to an integer by using the `int()` function:

In [5]:
# Casting a float to an integer
x = 3.0 # This is a float
y = int(x)
print(y)

3


While fundamentally, your data might be represented as numbers, Python's versatility is showcased through its various data types designed for organising numbers and more. Below are Python's essential data types:

- **Integers**: Whole numbers without a fractional part.
- **Floats**: Numbers with a decimal point.
- **Booleans**: `True` or `False` values.
- **Lists**: Ordered collections of items.
- **Dictionaries**: Collections of items accessed by unique keys.
- **Strings**: Textual data enclosed in quotes.
- **Tuples**: Ordered collections like lists but immutable.

In the upcoming sections, we'll dive into these data types, except for integers and floats, which we have already discussed above.


### Booleans

Booleans can only be in one of two states: `True` or `False`. Assign a variable to `True` or `False` below, and notice how Python syntax highlights these keywords, indicating their special role.


In [None]:
# Assigning a boolean value to a variable
is_sunny = True

# Checking the value
print(is_sunny)

Booleans are incredibly useful in **conditional statements**, where we tell the code: "If a certain condition is `True`, do 'X'; else, if another condition is `True`, do 'Y'." 

Often, we're using booleans without even realising it.

### Lists

Lists are Python's most versatile containers. You can put nearly anything inside a list, including different data types or even other lists. However, the practicality of mixed-type lists can be limited — the benefit of storing a series of numbers in a list is that you can perform operations on them collectively without concern for compatibility issues. Here's how to create a list:


In [None]:
# Defining a list
my_list = [1, "a_string", True, 3.14, [2, 4, 6]]

# Printing the list
print(my_list)

The above list is somewhat chaotic – it's rare to want or need a list with such diverse contents. However, it serves to demonstrate that Python is flexible about the types of items you can include in a list. Beyond manually specifying list contents, Python offers functions that can automatically generate lists, especially useful for creating sequences with a regular pattern:


In [None]:
# Generate a list with a regular form using a list comprehension
even_numbers = [x for x in range(2, 21, 2)]

# Printing the list
print(even_numbers)

This Python snippet demonstrates the use of a list comprehension to generate a list of even numbers. Here's how it works:

- `range(2, 21, 2)` creates a sequence of numbers starting from 2 up to (but not including) 21, with a step of 2. This step ensures that only even numbers are included.
- `[x for x in range(2, 21, 2)]` iterates over each number `x` in the sequence, adding `x` to the list.
- The result is a list of even numbers from 2 to 20, which is then printed out.

List comprehensions offer a concise way to create lists, making this method both efficient and easy to read.


<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Lists
  </h1>
</div>

Generate a list below containing the odd numbers 1, 3, 5, 7, ... 99 and save it into a variable called odd_count. Then, below, print it to verify your solution.

In [None]:
# Your code here

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>

## Indexing and Slicing

To view the contents of a list, we simply print it:


In [None]:
my_list[0]

We put closed brackets at the end of the variable name and specify which index we want. We can also pull multiple elements at once by specifying a range of indices (the first number is inclusive, the second is exclusive):

In [None]:
my_list[0:2]

We can also specify a skip value, which is the third number in the brackets. This number tells Python how many elements to skip between each element it pulls:

In [None]:
my_list[0:5:2]

What if we need to access an element within a nested list [2, 4, 6]? This situation calls for **double indexing**. Additionally, this is a good opportunity to introduce **negative indexing**, a handy feature that allows counting backwards from the end of a list, simplifying access to its latter elements:


In [None]:
my_list[-1][1]

The above code snippet demonstrates how to access the number 4 within the nested list [2, 4, 6]. The negative index `-1` refers to the last element in the list, which is itself a list. The index `[1]` then extracts the second element from that list.

So, to summarise, indexing allows you to access individual elements of an iterable, such as a list or a string, using their position.

- **Positive Indexing** starts from 0 at the beginning of the iterable and increases by 1 for each subsequent element.
- **Negative Indexing** starts from -1 for the last element, -2 for the second last, and so on, making it easy to access elements from the end.

Here's a table illustrating both positive and negative indexing for the string `"Python"`:

| Index | Character |
|-------|-----------|
| 0     | P         |
| 1     | y         |
| 2     | t         |
| 3     | h         |
| 4     | o         |
| 5     | n         |
| -6    | P         |
| -5    | y         |
| -4    | t         |
| -3    | h         |
| -2    | o         |
| -1    | n         |

This table demonstrates how each character in the string can be accessed using both positive and negative indices.


<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Indexing
  </h1>
</div>

In the cell below, using our list named my_list that contains an element "a_string" at a certain position, write a Python command to output the letter 's' from the string "a_string".

In [None]:
# Your code here

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>

### Dictionaries

Dictionaries function similarly to lists, but instead of using numeric indices to access elements, we use unique "keys" to associate with each "value". Here's an example:


In [None]:
# Create a dictionary
my_dict = {"name": "Alice", "age": 30, "city": "Sydney"}

# Accessing a value by its key
print(my_dict["name"])

This Python snippet demonstrates how to create a dictionary and access its elements. Dictionaries store data in key-value pairs, allowing you to quickly access a value by specifying its corresponding key:

- We define `my_dict` with three key-value pairs, mapping "name" to "Alice", "age" to 30, and "city" to "Sydney".
- To retrieve a value, such as Alice's name, we use the syntax `my_dict["name"]`, specifying the key to obtain its associated value.

Dictionaries are designed to map keys to values without preserving any order. This characteristic is beneficial for data where the relationship between keys and values is more important than the sequence of items. It simplifies tasks like looking up specific information without needing to know an element's position, making dictionaries a powerful tool for organising and retrieving data based on meaningful associations rather than order.

What if we wanted to add a new key-value pair to the dictionary? This is a simple task in Python:



In [None]:
# Adding a new key-value pair
my_dict["occupation"] = "Engineer"

# Inspecting the updated dictionary
print(my_dict)

Great, but let's now modify the value of an existing key. To do this, we simply reassign the value to the key:

In [None]:
# Modifying a value
my_dict["age"] = 31

# Inspecting the updated dictionary
print(my_dict)

As you can see, Alice is now 31 years old. This demonstrates how dictionaries are mutable, allowing you to change their contents as needed. Let's try one more change and remove a key-value pair:

In [None]:
# Removing a key-value pair
del my_dict["city"]

# Inspecting the updated dictionary
print(my_dict)

Dictionary keys are unique and immutable, meaning they can't be changed once assigned. However, the values associated with these keys can be modified, added, or removed as needed. You will find Dictionaries are great for storing data that requires meaningful associations between keys and values, such as user profiles, product details, or any structured data that benefits from key-based retrieval.

### Strings

We've previously encountered strings, which allow you to incorporate text (such as words or file paths) into your code that Python wouldn't inherently understand. Strings are incredibly versatile, capable of containing any character. When dealing with a data file comprising various data types, Python typically interprets it all as strings, leaving the conversion to more specific types, like integers or floats, up to you. Importantly, strings are iterable, meaning they can be indexed character by character, similar to lists.


In [None]:
# Defining a string
my_string = "Hello, world!"

# Accessing characters in the string
first_char = my_string[0]  # 'H'
last_char = my_string[-1]  # '!'

# Printing the characters
print(f"First character: {first_char}")
print(f"Last character: {last_char}")

This snippet illustrates several key aspects of working with strings:

- We create a simple string `my_string` with the value `"Hello, world!"` to show how text is stored.
- Using our knowledge of indexing, we can extract specific characters from the string. `my_string[0]` retrieves the first character (`'H'`), while `my_string[-1]` fetches the last character (`'!'`), demonstrating how strings are iterable.
- The `print()` function displays the first and last characters. The "f" before the string literals indicates an **f-string**, which allows for the direct insertion of expressions into string literals using curly braces `{}`. This method simplifies the process of combining text and variables/data in output.

Let's see what else we can do with strings. For instance, we can concatenate strings, which means combining them into a single string. This is achieved using the `+` operator:


In [None]:
# Concatenating strings
new_string = my_string + " from Python"
print(new_string)

Here, we've combined the original string `"Hello, world!"` with the phrase `" from Python"`, creating a new string that reads `"Hello, world! from Python"`. This operation is known as string concatenation. Let's now try to replace a portion of the string with another string:

In [None]:
# Replacing a substring
replaced_string = my_string.replace("world", "Python")
print(replaced_string)

Here, we've replaced the substring `"world"` in the original string with `"Python"`, resulting in the new string `"Hello, Python!"`. This operation is known as string replacement. Next, let's try splitting a string into a list of substrings:

In [None]:
# Splitting a string
split_string = my_string.split(",")
print(split_string)

Here, we've split the original string `"Hello, world!"` into a list of substrings, using the comma `","` as the separator. The result is a list `['Hello', ' world!']`, where the comma has been removed. This operation is known as string splitting. Finally, let's try case conversion on a string:

In [None]:
# Converting case
upper_case = my_string.upper()
lower_case = my_string.lower()
print(upper_case)
print(lower_case)

Here, we've converted the original string `"Hello, world!"` to uppercase using the `upper()` method, resulting in `"HELLO, WORLD!"`. We've also converted the string to lowercase using the `lower()` method, resulting in `"hello, world!"`. These operations are known as case conversion. 

Strings are incredibly versatile, offering a wide range of methods for manipulating and extracting data. This flexibility makes them invaluable for working with textual data, such as user input, file contents, or any text-based information you encounter in your code.

### Tuples

Tuples are similar to lists, but they are immutable, meaning their contents cannot be changed after creation. This characteristic makes them useful for storing data that shouldn't be altered, such as a set of coordinates or a date. To understand, let's first inspect our list and attempt to modify it:

In [None]:
print(my_list)

Let's modify the second element of the list:

In [None]:
my_list[1] = "new_string"
print(my_list)

Let's try the same with a tuple:

In [None]:
my_tuple = (1, "a_string", True, 3.14, [2, 4, 6])
print(my_tuple)

Note a tuple is created using parentheses `()` instead of square brackets `[]`. Now, let's try to modify the second element of the tuple:

In [None]:
my_tuple[1] = "new_string"

Notice the error message that appears when you try to modify the tuple. This is because tuples are immutable, meaning their contents cannot be changed after creation. This characteristic makes them useful for storing data that shouldn't be altered, such as a set of coordinates or a date.

## Functions

Functions allow you to encapsulate code into reusable blocks, making your code cleaner and more modular. You can define a function with a name and parameters, and it can return a result.

#### Defining a Function

To define a function in Python, you use the `def` keyword followed by the function name, parameters, and a colon. The indented block below the function header contains the code that the function will execute.


In [None]:
def greet(name):
    print(f"Hello, {name}!")

In this example, the function `greet()` takes one argument `name` and prints a greeting.

#### Calling a Function

Once the function is defined, you can call it with the appropriate arguments:

In [None]:
greet("Alice")

#### Functions with Return Values

You can also define functions that return values. The `return` statement sends a result back to the caller.

In [None]:
def add(a, b):
    return a + b

result = add(5, 3)
print(result)

In this example, the `add()` function takes two arguments and returns their sum.

## Indentation
Notice above how the code inside the function is indented. In Python, indentation is crucial for code structure. It is used to define blocks of code, such as the body of a function or loop. Indentation helps Python understand the structure of your code, so be sure to use consistent indentation throughout your programs.

In [1]:
if 5 > 3:
    print("5 is greater than 3")

5 is greater than 3


In this example, the code block inside the `if` statement is indented, indicating that it should be executed only if the condition `5 > 3` is `True`. If the condition is `False`, the indented code block will be skipped. Python will give you an error if you skip indentation:

In [2]:
if 5 > 3:
print("5 is greater than 3")

IndentationError: expected an indented block (1960757576.py, line 2)

The number of spaces used for indentation is up to you, but it must be consistent throughout your code. The standard practice is to use four spaces for each level of indentation. Most code editors will automatically convert tabs to spaces to ensure consistent indentation.

## Loops

Loops are used to execute a block of code repeatedly, either for a specified number of times or until a condition is met. The most common types of loops in Python are `for` loops and `while` loops.

#### For Loop

A `for` loop iterates over a sequence (like a list, string, or range) and executes the code block for each element.

In [None]:
for i in range(1, 6):
    print(i)

In this example, `range(1, 6)` generates numbers from 1 to 5, and the `for` loop prints each number.

#### While Loop

A `while` loop runs as long as a condition is `True`. The condition is checked before each iteration, and the loop continues executing until the condition is no longer satisfied.

In [None]:
count = 1
while count <= 5:
    print(count)
    count += 1

## Errors and Exceptions

There are two distinct types of errors in Python: **syntax errors** and **exceptions**. Syntax errors occur when the code is improperly written, such as a missing colon or parentheses. These errors are detected by Python before the code is executed, preventing the program from running. Let's generate a syntax error by omitting a closing parenthesis:

In [None]:
# Syntax error
print("Hello, world!"

When you run this cell, a `SyntaxError` will be raised, indicating that a closing parenthesis is missing. This error message is Python's way of informing you that the code is improperly written and needs correction before it can be executed.

The other type of error, exceptions, occurs when the code is syntactically correct but encounters an issue during execution. These errors are detected while the code is running, causing the program to halt. Let's generate an exception by attempting to divide by zero:

In [None]:
# Attempting to divide by zero
result = 5 / 0

When you run this cell, a `ZeroDivisionError` will be raised, indicating that you can't divide by zero. This error message is Python's way of informing you that the operation you attempted is mathematically impossible. Other possible errors include those listed in the table below.


&nbsp;  

| Exception                | Description                                                       |
|--------------------------|-------------------------------------------------------------------|
| ArithmeticError          | Raised when an error occurs in numeric calculations               |
| AssertionError           | Raised when an assert statement fails                             |
| AttributeError           | Raised when attribute reference or assignment fails               |
| Exception                | Base class for all exceptions                                     |
| EOFError                 | Raised when the input() method hits an "end of file" condition (EOF) |
| FloatingPointError       | Raised when a floating point calculation fails                    |
| GeneratorExit            | Raised when a generator is closed (with the close() method)       |
| ImportError              | Raised when an imported module does not exist                     |
| IndentationError         | Raised when indentation is not correct                            |
| IndexError               | Raised when an index of a sequence does not exist                 |
| KeyError                 | Raised when a key does not exist in a dictionary                  |
| KeyboardInterrupt        | Raised when the user presses Ctrl+c, Ctrl+z or Delete             |
| LookupError              | Raised when errors raised can't be found                          |
| MemoryError              | Raised when a program runs out of memory                          |
| NameError                | Raised when a variable does not exist                             |
| NotImplementedError      | Raised when an abstract method requires an inherited class to override the method |
| OSError                  | Raised when a system related operation causes an error            |
| OverflowError            | Raised when the result of a numeric calculation is too large      |
| ReferenceError           | Raised when a weak reference object does not exist                |
| RuntimeError             | Raised when an error occurs that do not belong to any specific exceptions |
| StopIteration            | Raised when the next() method of an iterator has no further values |
| SyntaxError              | Raised when a syntax error occurs                                 |
| TabError                 | Raised when indentation consists of tabs or spaces                |
| SystemError              | Raised when a system error occurs                                 |
| SystemExit               | Raised when the sys.exit() function is called                     |
| TypeError                | Raised when two different types are combined                      |
| UnboundLocalError        | Raised when a local variable is referenced before assignment      |
| UnicodeError             | Raised when a unicode problem occurs                              |
| UnicodeEncodeError       | Raised when a unicode encoding problem occurs                     |
| UnicodeDecodeError       | Raised when a unicode decoding problem occurs                     |
| UnicodeTranslateError    | Raised when a unicode translation problem occurs                  |
| ValueError               | Raised when there is a wrong value in a specified data type       |
| ZeroDivisionError        | Raised when the second operator in a division is zero             |

While you are unlikely to encounter all of these errors, it's essential to understand the most common ones and how to interpret their messages. Python's error messages are designed to be informative, helping you identify the issue and its location in your code, making you a more effective programmer.

While it is outside the scope of this session, you can write programs to handle exceptions, allowing your code to continue running even when errors occur. This is a crucial aspect of programming, ensuring your code can gracefully handle unexpected issues and continue functioning as intended. Within Jupiter notebooks, these are a little easier to catch as the error will be displayed in the output of the cell.

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Integrating Basic Concepts
  </h1>
</div>

Let's tackle a comprehensive example that incorporates the elements we've discussed.

Imagine you're teaching the introductory Quantum Mechanics unit. Surprisingly, after grading the mid-semester exam, you find many scores lower than expected, despite considering the exam quite fair!

To avoid alarming students with their individual scores, you opt to calculate the exam's statistical distribution first, allowing students to see where their score stands in relation to the class average and other statistics.

The exam scores (out of 120) are as follows: 100, 68, 40, 78, 81, 65, 39, 118, 46, 78, 9, 37, 43, 87, 54, 29, 95, 87, 111, 65, 43, 53, 47, 16, 98, 82, 58, 5, 49, 67, 60, 76, 16, 111, 65, 61, 73, 63, 115, 72, 76, 48, 75, 101, 45, 46, 82, 57, 17, 88, 90, 53, 32, 28, 50, 91, 93, 7, 63, 88, 55, 37, 67, 0, 79.

Begin by placing these numbers into a list named "scores". You can copy and paste the scores directly and add the list syntax in a cell below.


In [None]:
# Your code here

Our next step is to calculate the average score. While Python has libraries and functions for streamlined calculations, mastering the manual approach is invaluable at this stage.
 
First, start by summing all values using the `sum()` function on the list. This function takes a list as an argument and returns the sum of all its elements.

Next, to calculate an average, you take your sum then divide by the count of those numbers using the `len()` function.

With this method, go ahead and define a variable called "average_score" in the cell below to compute the average from the scores list.

In [None]:
# Your code here

Now that we've calculated the average score for the exam, let's convert that into a percentage. In the cell below, compute the average score's percentage by dividing it by the total points available on the test and then multiplying by 100. Execute the cell to see a sentence displaying the percentage. Examine the provided line that achieves this to understand how it functions.


Another crucial metric is the standard deviation from the mean. This statistic is especially relevant in educational contexts where grades are determined on a curve. The standard deviation provides insight into the spread of exam scores around this average, indicating the variability of students' performance.

The formula for calculating the standard deviation is:
$$
s = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (x_i - \bar{X})^2}
$$

In this equation:

- $s$ is the standard deviation, indicating how scores deviate, on average, from the mean score.
- $\bar{X}$ stands for the average (mean) score.
- $N$ represents the total count of scores, offering a denominator that normalizes the sum of squared deviations.
- $x_i$ refers to each individual score in the dataset.
- The expression $\sum_{1}^{N}(x_i - \bar{X})^2$ calculates the sum of squared differences between each score and the mean, highlighting the collective variance from the average.

This formula's numerator squares the deviation of each score from the mean before summing these values, which is then normalised by $N-1$ rather than $N$ to account for sample variance in statistics, providing a more accurate representation of dispersion for samples rather than entire populations.


To compute the standard deviation, the challenging part is determining the numerator of the fraction. Given the tools we've discussed so far, this can be somewhat complex. Hence, I'll introduce a new concept: Numpy arrays, which stand for numerical Python. We'll dive deeper into Numpy arrays next week. For now, take a look at the example below to grasp why they're instrumental for our calculations:


First, let's try to calculate the denominator of the standard deviation formula. This involves $N-1$.

In [None]:
print(scores-1)

Okay, so I can't subtract an integer from a list. What if I try NumPy arrays?

In [None]:
import numpy as np
arr_version = np.array(scores)
print(arr_version-1)

If you look, you should see that each of those scores is the original score with one subtracted off it. This is the power of NumPy arrays. They allow us to perform operations on entire arrays at once, which is crucial for calculating the standard deviation.

It's now your turn to calculate numerator component. In the cell below, fill in the variable I'm calling "top_frac" to calculate this quantity:
$$
\sum_{i=1}^N (x_i - \bar{X})^2
$$

Notice here that you don't have to actually calculate it one by one - if we first compute a single array that represents each score with the mean subtracted off and then that value squared, then we finish off top_frac just by summing up that array as we've done before. Feel free to use my variable "arr_version".

In [None]:
# Your code here

With that done, we can easily apply the formula to get the final STD - **Hint:** the function np.sqrt() will be useful here.

In [None]:
# Your code here

Great! If all steps were followed correctly, you'd discover the average score is 62/120, and the standard deviation is 28. Let's now spoil everything and I will show you how you could have done this with one line:


In [None]:
STD_scores_2 = np.std(arr_version, ddof=1)
print(STD_scores_2)

What is this `ddof` in NumPy?

- Setting `ddof=0` (the default) instructs NumPy to divide by `N`, which is suitable for calculating the population variance and standard deviation.
- Setting `ddof=1` modifies the calculation to divide by `N-1`, making it suitable for sample variance and standard deviation. 

This distinction is crucial in statistics, ensuring that the sample standard deviation accurately reflects the variability of the data. For our purposes, we're working with a sample of exam scores, so we set `ddof=1` to calculate the standard deviation correctly. We will invesigate statistics further in a later practical.


Just for a bit of fun, let's create an informative plot to visually represent the students' scores. Don't stress about understanding the plotting details for now — we'll explore this later.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def plot_errorbars(arg, **kws):
    plt.hist(scores,20,density=True)
    plt.axis('off')
    plt.show()
    f, axs = plt.subplots(2, figsize=(7, 2), sharex=True)
    sns.pointplot(x=scores, errorbar=arg, **kws, capsize=.3, ax=axs[0])
    sns.stripplot(x=scores, jitter=.3, ax=axs[1])
    
plot_errorbars("sd")

This plot is a combination of a simplified histogram, a point plot, and a strip plot. Here's what each part represents:

Histogram
- The top plot is a **Histogram** that displays the frequency distribution of your exam scores.
- Each bar represents a range of scores, and the height shows how many students scored within that range.
- This visual helps us understand common score ranges and identify where most students fall.

Point and Strip Plots
- Below the histogram, the **Point Plot** indicates the average (mean) score with a horizontal line for the standard deviation.
- The mean score, represented by the dot, is around 62. The line extending from it shows the standard deviation, which is about 28.3 points.
- This means most scores are within 28.3 points above or below the average, giving you a sense of how spread out the scores are.
- The **Strip Plot** displays each student's individual score, giving us a look at each unique score without them stacking on top of each other.

What This Means for the Students
- If their score is close to the average (62), they're in the most common score range for this exam.
- If their score is far from the mean, the standard deviation helps them understand how their score compares with the rest of the class.
- Remember, this distribution is a snapshot of this particular exam performance and each exam can have a different pattern.

What This Means for you, the Teacher
- The histogram helps you understand the distribution of scores, identifying common score ranges and outliers.
- Perhaps you set the exam too difficult here and you may need to adjust in the future!


<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>
</div>

### Importing Data

Computer-based data storage has evolved significantly from the days when nearly all information was kept in plaintext ASCII files, which are essentially text tables (American Standard Code for Information Interchange). While data storage practices have modernised, text tables continue to be a highly efficient means of storing data due to their simplicity and ease of access. Below, we will focus on the "quantum_mechanics.grades" file, an example of a text table file. It is common for such data files to have custom extensions like `.grades` or `.students` instead of the generic `.txt` to indicate their content type.

Please ensure that the "quantum_mechanics.grades" file is in the same directory as this notebook. We will then use Pandas, a powerful Python library for data manipulation and analysis, to load this data into a DataFrame. This step will help us understand the structure and contents of the data. If you have not yet installed Pandas, or need the latest version, make sure to install or update Pandas before we begin.

First, let's check whether you already have Pandas installed and its current version. You can do this by running the following code in a Jupyter notebook cell:

In [None]:
import pandas as pd
print(pd.__version__)

If Pandas is not installed on your system, you can install it using `pip`, the Python package installer. Execute the following command in your Jupyter notebook:
```python
!pip install pandas

If you already have Pandas installed and need to update it to the latest version, you can do so by executing the following command in your Jupyter notebook:
```python
!pip install --upgrade pandas

When working with data files, it's crucial to understand that they can come in various formats, each suitable for different types of data manipulation and analysis tasks. Two common terms you'll encounter are ASCII and delimited text files:

- **ASCII (American Standard Code for Information Interchange)**: This is a character encoding standard used to represent text in computers and other communication devices. Most text files you'll encounter are based on ASCII or one of its extensions, which supports additional characters beyond the basic set.

- **Delimited Text Files**: These are text files where data elements are separated by a specific character, known as a delimiter. Common delimiters include commas (`,`) for CSV (Comma-Separated Values) files, tabs for TSV (Tab-Separated Values) files, and spaces for space-delimited files. Delimited files are easy to read and write both by humans and machines, making them widely used in data processing.

Before importing data, it's important to determine the type of file you are dealing with. Here are a few methods to help you identify the file format:

1. **File Extension Inspection**: The file extension often gives a clue about the format of the data. Common extensions include:
   - `.csv` for comma-separated values
   - `.tsv` for tab-separated values
   - `.txt` for plain text files
   - Custom extensions like `.grades`, `.students`, etc., which might need further inspection.

2. **Opening the File in a Text Editor**: Open the file in a text editor (e.g., Notepad, Sublime Text, VS Code) and look at the first few lines to identify patterns:
   - **Comma-Separated Values (CSV)**: Look for commas separating values.
   - **Tab-Separated Values (TSV)**: Look for tabs separating values.
   - **Space-Delimited**: Look for spaces separating values.
   - **ASCII or Plain Text**: Look for simple text without any special delimiters.

3. **Using Python's Built-in Functions**: Read the first few lines of the file using Python to understand its format. For example:
   ```python
   def inspect_file(file_name):
       with open(file_name, 'r') as file:
           for i in range (5):  # Read the first 5 lines
               line = file.readline().strip()
               print(line)

Let's inspect our file to understand its structure and contents before importing it into a DataFrame. We will use the above `inspect_file()` function to read the first few lines of the "quantum_mechanics.grades" file and identify its format.

In [None]:
# Function to inspect the first few lines of a file
def inspect_file(file_name):
       with open(file_name, 'r') as file:
           for i in range (5):  # Read the first 5 lines
               line = file.readline().strip()
               print(line)

# Assign 'quantum_mechanics.grades' to the file_name variable
file_name = 'quantum_mechanics.grades'

# Run the inspect_file function on the file
inspect_file(file_name)

It appears our file is a simple text table with grades listed in each row. The grades are likely separated by spaces or tabs, as we can see multiple values in each line. We will now import this data into a Pandas DataFrame to explore it further. We can use the `read_csv()` function from Pandas to load the data from the file into a DataFrame. This function is versatile and can handle various file formats, including CSV, TSV, and custom delimited files.

For our data, we will assign the separator as `\s+` to handle multiple spaces as a delimiter. The `\s+` separator is a regular expression pattern that matches one or more whitespace characters, which includes spaces and tabs. This ensures that any number of spaces between columns is correctly interpreted as a delimiter.

Here are some common separators you might encounter:

- `,` for comma-separated values (CSV)
- `\t` for tab-separated values (TSV)
- `;` for semicolon-separated values (often used in European data sets)
- `|` for pipe-separated values

To load the "quantum_mechanics.grades" file into a DataFrame, we will use Pandas' `read_csv` function with the following parameters:

- **`sep`**: Specifies the character that separates values in our data file. For this dataset, grades are separated by spaces, so we use `sep='\s+'`, a regular expression that matches one or more spaces.
- **`header`**: This parameter is set to `None` indicating that the first line in the file is not a header row. It's common for data files without column names. If the data file does contain a header row, you can set `header=0` to use the first row as column headers.
- **`names`**: Allows us to assign names to the columns in our DataFrame. Since our data consists of a single column of grades, we will name it 'Grades'. If the file already contains column headers and `header=0` is used, this parameter can be omitted.


In [None]:
grades_df = pd.read_csv('quantum_mechanics.grades', sep='\s+', header=None, names=['Name', 'Grades'])

Let's take a look at the first few rows of the dataset to ensure it's loaded correctly. We can use the `head()` method to display the first five rows of the DataFrame. This method is useful for quickly examining the structure and contents of the data.


In [None]:
print(grades_df.head())

You may recall from the first practical, we spent some time determining the average grade and standard deviation for the quantum mechanics course. Now that we have the data loaded into a DataFrame, we can use Pandas to calculate these statistics more efficiently. The `describe()` method provides a summary of basic statistics for the data, including the count, mean, standard deviation, minimum, maximum, and quartiles.

In [None]:
print(grades_df.describe())

The other values here include the minimum and maximum grades, as well as the 25th, 50th, and 75th percentiles. These statistics provide a comprehensive overview of the data distribution, allowing us to understand the range of grades and how they are distributed.

In addition to `describe()`, here are some other useful methods for inspecting data:

- **`info()`**: Provides a concise summary of the DataFrame, including the number of non-null values and data types of each column.
- **`shape`**: Returns a tuple representing the dimensionality of the DataFrame.
- **`columns`**: Returns the column labels of the DataFrame.
- **`dtypes`**: Returns the data types of each column.


In [None]:
# Let's inspect the DataFrame using these methods
print(grades_df.info())  # Display a concise summary of the DataFrame
print("\nDataFrame Shape:", grades_df.shape)  # Display the shape of the DataFrame
print("\nColumn Labels:", grades_df.columns)  # Display the column labels
print("\nData Types:\n", grades_df.dtypes)  # Display the data types of each column

We see there are two columns in the DataFrame: 'Name' and 'Grades'. The 'Name' column contains student names, while the 'Grades' column contains the grades for each student. The data types are `object` for 'Name' and `int64` for 'Grades', indicating that 'Name' is a text column and 'Grades' is a numeric column. We also see the shape of the DataFrame is (65, 2), meaning it has 65 rows and 2 columns.

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Importing Data and Inspection
  </h1>
</div>

In this activity, you will work with a dataset titled `carbonmonoxide-qld-2022.csv`, which includes measurements of carbon monoxide levels across Queensland in 2022. Your task is to import this dataset into a DataFrame, inspect it for an initial understanding, and prepare it for further analysis.

First, ensure the dataset file is in the same directory as your notebook or accessible via a path. Use Pandas to load the data into a DataFrame. Given that the file is a CSV, ensure you incorporate the correct delimiter.


In [None]:
# Your code here

Display the first few rows of the dataset

In [None]:
# Your code here

Display the first 15 rows of the dataset. Hint: the `head()` function allows you to specify the number of rows to display.


In [None]:
# Your code here

Print statistical summaries of the data


In [None]:
# Your code here

Provide some code to understand the data file's structure and contents. 


In [None]:
# Your code here

If you've run the above correctly, you will have printed a concise summary of the DataFrame, the shape of the DataFrame, the column labels, and the data types of each column. Here's a breakdown of the output:

1. **Class Type**:
   - The output begins by confirming that `co_df` is indeed a DataFrame object, which is a standard data structure in Pandas used for data manipulation.

2. **Index**:
   - `RangeIndex: 8760 entries, 0 to 8759` indicates that the DataFrame uses a default integer index for row labeling, starting at 0 and ending at 8759. This index includes 8760 entries, suggesting the dataset covers 8760 periods (hourly data over a year, as there are 24*365 = 8760 hours in a non-leap year).

3. **Columns**:
   - The DataFrame contains four columns. The method lists each column along with the number of non-null entries and the data type (`Dtype`) of each column:
     - **Date**: All 8760 entries are non-null, meaning there are no missing values in this column. The data type is `object`, typically used in Pandas for text or mixed numeric and non-numeric data.
     - **Time**: Similar to the Date column, this also has 8760 non-null entries and is of type `object`.
     - **South Brisbane (South East Queensland) (ppm)**: This column has 8190 non-null entries, indicating some missing data (570 missing values). The data type is `float64`, used for floating-point numbers.
     - **Woolloongabba (South East Queensland) (ppm)**: Contains 8438 non-null entries, also suggesting missing data (322 missing values), and it is of type `float64`.

4. **Data Types Summary**:
   - The DataFrame has two data types: `float64` for numeric floating-point data and `object` for text or mixed data types. The summary shows there are two columns of each type.

5. **Memory Usage**:
   - The memory usage of this DataFrame is approximately 273.9 KB. This information is useful for understanding how much memory the DataFrame consumes, which can be important when working with large datasets or when performance is a concern.

By understanding each of these aspects, you can better assess the initial state of your data and plan any necessary preprocessing steps, such as data cleaning and data transformation. 


<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>

### Data Cleaning

Data cleaning is a critical step in the data analysis process, involving the refinement of raw data into a more usable format. This stage is essential for ensuring the accuracy and quality of insights derived from data analysis. Here are key aspects and benefits of data cleaning:

- Clean data is crucial for generating reliable and accurate analysis results. Inaccurate data can lead to faulty conclusions and decisions.
- Cleaning data helps in reducing redundancy and improving the efficiency of data storage and data processing.
- Well-organised and clean data is easier to manipulate and analyse, making the data analysis process smoother and faster.

Common data cleaning tasks include:
- Identifying and addressing missing data, which may involve imputation (filling missing values) or removal, depending on the scenario.
- Eliminating duplicate records that may skew analysis results.
- Ensuring consistency in data types and formats across the dataset, which is crucial for comparative analysis.

Before you begin cleaning data, it's important to understand the context and goals of your analysis, as these factors heavily influence how you approach data cleaning. For instance, the way we prepare data for forecasting future trends (predictive modeling) might differ from how we set it up just to describe what has happened (descriptive analysis).

In the following sections, we will apply some data cleaning techniques to our carbon monoxide data, allowing you to practice and observe their impact on data quality and analysis outcomes.

#### Handling Missing Values

Recall we used the `info()` method to identify missing values in the carbon monoxide dataset. Let's revisit this information and explore the missing values in more detail. Missing values can occur for various reasons, such as data collection errors, sensor malfunctions, or data processing issues. It's essential to identify and address missing values appropriately to ensure the accuracy and reliability of data analysis results.

In [None]:
missing_values_count = co_df.isnull().sum() # Count the number of missing values in each column
print("Missing Values per Column:")
print(missing_values_count)

The output shows the number of missing values in each column of the dataset. The `isnull()` method returns a DataFrame of the same shape as `co_df`, where each cell contains a boolean value indicating whether the corresponding cell in `co_df` is null (`True`) or not (`False`). By summing these boolean values, we can count the missing values in each column.

To visualise the missing values more clearly, we can use a heatmap to represent the distribution of missing values across the dataset. This visualisation helps us identify patterns and clusters of missing values, making it easier to decide on the best approach for handling them. Given visualising data is covered later, do not concern yourself with the code below, but rather focus on the concept of visualising missing values.

In [None]:
import seaborn as sns # Import Seaborn for data visualisation
import matplotlib.pyplot as plt # Import Matplotlib for plotting

# Visualise missing values using a heatmap
plt.figure(figsize=(10, 6)) # Set the figure size
sns.heatmap(co_df.isnull(), cbar=False, cmap='Reds') # Create a heatmap of missing values
plt.title('Heatmap of Missing Values') # Set the plot title
plt.show() # Display the plot

The output shows a heatmap of missing values in the dataset, with missing values represented in red. The heatmap provides a visual representation of the distribution of missing values across the dataset, highlighting columns with a high concentration of missing values. This visualisation can help us identify patterns and clusters of missing values, guiding our decision on how to handle them effectively.

Let's say we want to address the missing values in the dataset by filling them with the mean value of each column. This process, known as imputation, is a common technique for handling missing data. We can use the `fillna()` method to replace missing values with the mean of each column. This method is flexible and allows us to specify the value used for filling missing entries. To start with, we need to import the NumPy library, which provides support for mathematical operations in Python. We will then define the mean value for each column and fill the missing values accordingly. Finally, we will verify that the dataset no longer contains any missing values.

In [None]:
import numpy as np # Import NumPy for numerical operations
numeric_means = co_df.select_dtypes(include=[np.number]).mean() # Calculate the mean value for each numeric column
co_df.fillna(numeric_means, inplace=True) # Fill missing values with the mean value for each column

The code above calculates the mean value for each numeric column in the dataset using the `mean()` method. It then fills the missing values in each numeric column with the corresponding mean value using the `fillna()` method. The `inplace=True` parameter ensures that the changes are applied directly to the DataFrame, updating the dataset with the imputed values.

In [None]:
print(co_df.isnull().sum())

Note that using the mean to fill missing values is just one of many imputation strategies. Depending on the context and nature of the data, you may choose other methods, such as filling with the median, mode, or a specific value based on domain knowledge. It's essential to consider the implications of each imputation strategy on the data quality and analysis outcomes.

Let's again visualise the missing values using a heatmap to confirm that the dataset no longer contains any missing values.


In [None]:
plt.figure(figsize=(10, 6)) # Set the figure size
sns.heatmap(co_df.isnull(), cbar=False, cmap='Reds') # Create a heatmap of missing values
plt.title('Heatmap of Missing Values') # Set the plot title
plt.show() # Display the plot

The heatmap now shows no missing values in the dataset, as all cells are white, indicating that the missing values have been successfully imputed. This visualisation confirms that the dataset is now complete, with no missing values present. Imputation is a common technique for handling missing data and ensuring the completeness and reliability of the dataset for analysis.

Another common data cleaning task is to remove rows with missing values entirely. This approach is useful when the missing values are significant in number or when imputation is not appropriate for the analysis. We can use the `dropna()` method to eliminate rows with missing values from the DataFrame. This method removes any row containing at least one missing value, effectively reducing the dataset size.

First, we have to reload the original dataset to restore the missing values before proceeding with the removal of rows with missing values.

In [None]:
co_df = pd.read_csv('carbonmonoxide-qld-2022.csv', sep=',') # Reload the original dataset

Now, we can use the `dropna()` method to remove rows with missing values from the DataFrame.

In [None]:
co_df.dropna(inplace=True)

After removing rows with missing values, we can verify that the dataset no longer contains any missing values by checking the sum of missing values per column.


In [None]:
print(co_df.isnull().sum())

Notice that the output shows zero missing values for all columns, indicating that the dataset no longer contains any missing values. Removing rows with missing values can be an effective strategy for improving data quality and analysis outcomes, especially when the missing values are significant in number or when imputation is not suitable. If we now check the dataset's information, we should see that all columns have the same number of non-null entries.

In [None]:
print(co_df.info())

However, notice that the number of entries has decreased. This reduction in the number of rows is due to the removal of rows with missing values. It's essential to consider the trade-offs between data quality and data quantity when deciding whether to remove rows with missing values. In some cases, the loss of data may outweigh the benefits of removing missing values, so it's crucial to evaluate the impact on the analysis outcomes. In astrophyics, for example, removing rows with missing values may not be an option due to the scarcity of data points!


### Removing Duplicates
Another common data cleaning task is to identify and remove duplicate records from the dataset. Duplicate records can skew analysis results and lead to inaccurate conclusions. We can use the `duplicated()` method to identify duplicate rows in the DataFrame. This method returns a boolean Series indicating whether each row is a duplicate of a previous row.


In [None]:
co_df = pd.read_csv('carbonmonoxide-qld-2022.csv', sep=',') # Reload the original dataset to restore the duplicate records
duplicates = co_df.duplicated(subset=['Date', 'Time'], keep=False) # Check for duplicate records based on the 'Date' and 'Time' columns
print("Number of duplicates: ", duplicates.sum())

In this case, there are no duplicate records in the dataset. Let's instead contaiminate the dataset with duplicates and then remove them. First, let's duplicate the first two rows of the dataset, add them to the end of the DataFrame, and then shuffle the DataFrame to ensure the duplicates are not just at the bottom.


In [None]:
duplicates_to_add = co_df.iloc[0:2] # Select the first two rows
co_df_with_duplicates = pd.concat([co_df, duplicates_to_add], ignore_index=True) # Add the duplicates to the end of the DataFrame
co_df_with_duplicates = co_df_with_duplicates.sample(frac=1, random_state=1).reset_index(drop=True) # Shuffle the DataFrame

Let's check the number of rows in the original dataset and the contaminated dataset to confirm that the duplicates have been added.


In [None]:
print("\nNumber of rows in the original data frame:", len(co_df))
print("\nNumber of rows in our contaminated data frame:", len(co_df_with_duplicates))

We can re-run the code from earlier to identify the number of duplicate records in the contaminated dataset.


In [None]:
duplicates = co_df_with_duplicates.duplicated(subset=['Date', 'Time'], keep=False)
print("Number of duplicates: ", duplicates.sum())

We can also display the duplicate rows to see which records are duplicated.


In [None]:
all_duplicate_rows = co_df_with_duplicates[co_df_with_duplicates.duplicated(keep=False)]
all_duplicate_rows

As you can see, the DataFrame now contains duplicate records. We can use the `drop_duplicates()` method to remove these duplicate rows from the DataFrame. This method removes duplicate rows based on the specified columns, keeping only the first occurrence of each unique row. 


In [None]:
co_df_with_duplicates.drop_duplicates(subset=['Date', 'Time'], keep='first', inplace=True)

The `drop_duplicates()` method removes duplicate rows from the DataFrame based on the specified columns. The `keep='first'` parameter ensures that only the first occurrence of each unique row is retained, while subsequent duplicates are removed. After removing the duplicate records, we can verify that the dataset no longer contains any duplicate rows by running our checks from above:


In [None]:
duplicates = co_df_with_duplicates.duplicated(subset=['Date', 'Time'], keep=False)
print("Number of duplicates: ", duplicates.sum())
all_duplicate_rows = co_df_with_duplicates[co_df_with_duplicates.duplicated(keep=False)]
all_duplicate_rows

The output confirms that the dataset no longer contains any duplicate records, as the number of duplicates is now zero. Removing duplicate records is essential for ensuring the accuracy and reliability of data analysis results, as duplicate records can skew analysis outcomes and lead to incorrect conclusions. Say we wanted to now save our cleansed dataset to a new file. We can use the `to_csv()` method to write the DataFrame to a CSV file. This method allows us to specify the file path and name for the output file.


In [None]:
co_df_with_duplicates.to_csv('carbonmonoxide-qld-2022-cleaned.csv', index=False)

Note that since we introduced duplicates into the dataset, the saved the cleansed dataset `carbonmonoxide-qld-2022-cleaned.csv` should be exactly the same as our original dataset `carbonmonoxide-qld-2022.csv`! In a real-world scenario, you would save the cleansed dataset to a new file to preserve the original data for reference and comparison.


#### Standardising Data Formats

Standardising data formats is crucial for ensuring consistency and comparability across the dataset. Inconsistent data formats can lead to errors in data analysis and visualisation, making it challenging to draw meaningful insights from the data. Let's explore an example of standardising data formats by converting the 'Date' and 'Time' columns in the carbon monoxide dataset to a single datetime column.

First, let's reload the original dataset to restore the 'Date' and 'Time' columns.

In [None]:
co_df = pd.read_csv('carbonmonoxide-qld-2022.csv', sep=',') # Reload the original dataset
print(co_df[['Date', 'Time']].head()) # Display the 'Date' and 'Time' columns

The 'Date' and 'Time' columns are currently stored as separate columns in the dataset. We can combine these columns into a single datetime column by concatenating the 'Date' and 'Time' values and converting them to a datetime format. This process allows us to standardise the data format and create a unified timestamp for each record.


Step 1: Convert 'Date' to datetime type without a time component

In [None]:
co_df['Date'] = pd.to_datetime(co_df['Date'], format='%d/%m/%Y') # Convert 'Date' to datetime format

In the above, we use the `pd.to_datetime()` method to convert the 'Date' column to a datetime format. The `format='%d/%m/%Y'` parameter specifies the date format in the 'Date' column, which is 'day/month/year'. This format is common in many countries, including Australia, where the day precedes the month in date representations.

Step 2: Ensure 'Time' is in proper string format if it's not already a timedelta (e.g., '1:00' should be '01:00:00' to match HH:MM:SS)

In [None]:
co_df['Time'] = pd.to_timedelta(co_df['Time'] + ':00')

In the above, we use the `pd.to_timedelta()` method to convert the 'Time' column to a timedelta format. We append ':00' to the 'Time' values to ensure they are in the proper HH:MM:SS format. This step is crucial for standardising the time component and ensuring consistency in the datetime representation.


Step 3: Combine 'Date' and 'Time' into a new 'DateTime' column

In [None]:
co_df['DateTime'] = co_df['Date'] + co_df['Time']

Finaly, in the above, we combine the 'Date' and 'Time' columns into a new 'DateTime' column by adding them together. This operation creates a unified datetime column that includes both the date and time components for each record. The resulting 'DateTime' column standardises the data format and provides a consistent timestamp for analysis.

Let's now display the first few rows to verify the changes

In [None]:
print(co_df[['Date', 'Time', 'DateTime']].head())

Since we no longer need the 'Date' and 'Time' columns, we can drop them from the dataset to avoid redundancy. We can use the `drop()` method to remove these columns from the DataFrame.

In [None]:
co_df.drop(columns=['Date', 'Time'], inplace=True)

Let's inspect the DataFrame to ensure the 'Date' and 'Time' columns have been successfully removed.

In [None]:
co_df

The output confirms that the 'Date' and 'Time' columns have been removed from the dataset, leaving only the 'DateTime' column as the unified timestamp for each record. 

While not critical, perhaps you'd like to move the 'DateTime' column to the front of the DataFrame for better visibility. We can achieve this by reordering the columns in the DataFrame.

In [None]:
co_df = co_df[['DateTime'] + [col for col in co_df.columns if col != 'DateTime']] # Move 'DateTime' column to the front

In the above, we use a list comprehension to reorder the columns in the DataFrame. The expression `[col for col in co_df.columns if col != 'DateTime']` iterates over the columns in the DataFrame and excludes the 'DateTime' column. We then concatenate the 'DateTime' column with the remaining columns to move it to the front of the DataFrame. Let's now inspect the DataFrame to confirm the column reordering.


In [None]:
co_df

The code above reorders the columns in the DataFrame, moving the 'DateTime' column to the front. This step enhances the readability of the dataset and ensures that the primary timestamp information is easily accessible for analysis.

The above is just one example of standardising data formats in a dataset. Depending on the context and requirements of your analysis, you may need to perform additional data standardisation tasks, such as converting units of measurement, normalising data ranges, or transforming categorical variables into numerical representations. These steps are essential for ensuring data consistency and comparability across the dataset, enabling more accurate and reliable analysis outcomes.

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Worked Example: Astronomical Data Transformation
  </h1>
</div>

In astronomy, accurate coordinate systems are crucial for locating objects in the sky. However, these coordinates can change over time due to the precession of Earth's axis. Therefore, astronomers often need to convert coordinates from one epoch to another to maintain accuracy in their observations and comparisons.

In this example, we will explore the process of converting astronomical coordinates from the J1950 epoch to the J2000 epoch. We will start by importing a dataset containing coordinates in the J1950 epoch, and then we will apply a transformation to update these coordinates to the J2000 standard.

This transformation is essential for ensuring that the coordinates remain relevant and accurate for current astronomical research and applications.

#### 1. Importing and Inspecting the Data

Let's start by importing the data and taking a look at the first few rows.

In [None]:
import pandas as pd # Importing pandas library
df_J1950 = pd.read_csv('Astronomical_Coordinates_J1950.csv') # Importing the dataset
print(df_J1950.head(10)) # Displaying the first 10 rows of the dataset

The dataset contains two columns: `RA_J1950` and `Dec_J1950`, representing the right ascension and declination coordinates in the J1950 epoch, respectively. These coordinates are essential for locating celestial objects in the sky. Let's proceed with transforming these coordinates to the J2000 epoch.


In [None]:
from astropy.coordinates import SkyCoord # Importing the SkyCoord class from the astropy.coordinates module
import astropy.units as u # Importing the astropy.units module

The above code imports the necessary modules for performing the coordinate transformation. We will use the `SkyCoord` class from the `astropy.coordinates` module to represent the astronomical coordinates. The `astropy.units` module will help us define the units for the coordinates.


In [None]:
coords_J1950 = SkyCoord(ra=df_J1950['RA (J1950)'], dec=df_J1950['DEC (J1950)'],
                        unit=(u.hourangle, u.deg), frame='fk4', obstime='J1950') # Creating a SkyCoord object with J1950 coordinates

The above code creates a `SkyCoord` object named `coords_J1950` with the right ascension and declination coordinates from the dataset. We specify the units for the right ascension (hourangle) and declination (degrees) and set the frame to 'fk4' to indicate the J1950 epoch. The `obstime` parameter is set to 'J1950' to specify the epoch of the coordinates.


In [None]:
coords_J2000 = coords_J1950.transform_to('fk5') # Transforming the coordinates to J2000

The `transform_to` method is used to convert the coordinates from the J1950 epoch to the J2000 epoch. The method takes the target frame 'fk5' as an argument, which represents the J2000 epoch. The transformed coordinates are stored in the `coords_J2000` object. While this example is specific to astrophysical data, the concept of transforming data from one format to another is applicable to various fields of study. It may be as simple as converting units or as complex as changing coordinate systems, as demonstrated here.


Let's print the original and transformed coordinates side by side to compare the differences.

In [None]:
# Print original and transformed coordinates side by side
for original, transformed in zip(coords_J1950, coords_J2000):
    print(f"J1950: RA {original.ra.to_string(unit=u.hour, sep=':', pad=True)}, DEC {original.dec.to_string(unit=u.deg, sep=':', alwayssign=True, pad=True)}")
    transformed_ra_str = transformed.ra.to_string(unit=u.hour, sep=':', pad=True, precision=0)
    transformed_dec_str = transformed.dec.to_string(unit=u.deg, sep=':', alwayssign=True, pad=True, precision=0)
    print(f"J2000: RA {transformed_ra_str}, DEC {transformed_dec_str}")
    print()  # Adds a blank line for better readability between entries


We can also visualize the differences between the J1950 and J2000 coordinates using a plot with vector arrows to represent the changes in the coordinates.


In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 6))

# Extract RA and DEC for both epochs
ra_J1950 = coords_J1950.ra.wrap_at(180*u.deg).degree
dec_J1950 = coords_J1950.dec.degree
ra_J2000 = coords_J2000.ra.wrap_at(180*u.deg).degree
dec_J2000 = coords_J2000.dec.degree

# Calculate differences for vector arrows
delta_ra = ra_J2000 - ra_J1950
delta_dec = dec_J2000 - dec_J1950

# Plot J1950 coordinates with star markers
ax.scatter(ra_J1950, dec_J1950, color='yellow', alpha=0.6, edgecolors='none', s=200, marker='*', label='J1950')
# Plot J2000 coordinates with star markers
ax.scatter(ra_J2000, dec_J2000, color='red', alpha=0.6, edgecolors='none', s=200, marker='*', label='J2000')

# Adding vector arrows
ax.quiver(ra_J1950, dec_J1950, delta_ra, delta_dec, angles='xy', scale_units='xy', scale=1, color='green', width=0.003, headwidth=8, headlength=4)

# Setting the limits for the plot to zoom in on the region of interest
ax.set_xlabel('Right Ascension (degrees)')
ax.set_ylabel('Declination (degrees)')
ax.set_title('Comparison of Astronomical Coordinates: J1950 vs J2000')
ax.legend(loc='upper right')
plt.show()


There we have it! The plot shows the comparison between the J1950 and J2000 coordinates, with vector arrows indicating the changes in the coordinates due to the epoch transformation. This visualisation helps us understand the differences between the two epochs and how the coordinates have shifted over time. 

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Example
  </hjson>

It's now your turn to apply the concepts you've learned above to the quantum mechanics grades dataset.

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Data Transformation
  </h1>

In this activity, we'll take the grades from our quantum mechanics course and transform them into a standardised format for analysis. The grades are currently out of 120, and we want to convert them to a percentage scale (0-100) to ensure consistency and comparability.  First, start by importing the 'quantum_mechanics.grades' file into a DataFrame and inspecting the data to understand its structure and contents.

In [None]:
# Your code here

Next, apply a transformation to the 'Grades' column to standardise the format. For this exercise, you should convert the grades to a percentage scale (0-100) to ensure consistency and comparability. The grades are currently out of 120, so you'll need to convert them to a percentage scale by dividing each grade by 120 and multiplying by 100. Update the 'Grades' column with the transformed values.

In [None]:
# Your code here

Now that you've transformed the grades into a standardised format, you can save the updated DataFrame to a new file for further analysis. Use the `to_csv()` method to write the DataFrame to a CSV file, ensuring that the transformed grades are preserved for future reference and analysis. We can also visually inspect the distribution of the original and transformed grades using a histogram to observe the changes in the data.


In [None]:
grades_df['Grades'].plot(kind='hist')

In [None]:
grades_df['Scaled Grades'].plot(kind='hist', color='orange')

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>

## Basic Plotting

We will now delve into the fundamentals of data visualisation, which helps convey complex insights in a clear and accessible way, making it easier to communicate findings to a broad audience. You'll learn how to create and interpret various types of visualisations to highlight patterns, trends, and relationships in your data.

**Note:** Make sure you have the latest versions of `matplotlib` and `seaborn` installed. Your data files should either be uploaded to Google Colab, in the same directory as your Jupyter notebook, or accessible through a URL to avoid path issues while loading files.


In [None]:
# Import the required libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

Let's begin by looking at a variety of plots and how they can be used to represent different types of data. We will cover the following types of plots. As always, you are encouraged to run the code snippets in your own Python environment to see the plots in action. Let's start with the basics.


#### Pie Chart
Pie chats represent data in a circular graph where the size of each slice is proportional to the quantity it represents. Typically, they are divided into 2 or 3 segments as more segments can make it difficult to interpret the data. Let's create a pie chart using some random data.

In [None]:
# Random Data
labels = ['A', 'B', 'C']
sizes = [50, 30, 20]

Here I have provided some random data for the labels and sizes. The labels represent the categories and the sizes represent the values for each category. We can now create a pie chart using this data.


In [None]:
# I begin by creating a pie chart using the `plt.pie()` function from the `matplotlib` library.
plt.pie(sizes, labels=labels) # labels are the categories and sizes are the values for each category
plt.title('Pie Chart Example') # Add a title to the chart
plt.show() # This function displays the chart

This is a very simple example, but it lacks some important information. For example, it is difficult to determine the exact value of each segment. We can add this information by using the `autopct` parameter in the `plt.pie()` function.

In [None]:
plt.pie(sizes, labels=labels, autopct='%1.1f%%') # autopct shows the percentage on the chart
plt.title('Pie Chart Example with Percentage') # Add a title to the chart
plt.show()

So we labelled this with the `autopct` parameter. Let's break down the `'%1.1f%%'` string:
  - **%**: This indicates that we are using a format string.
  - **1.1f**: 
    - **1**: The total number of digits before the decimal point.
    - **.1**: Indicates that only one digit will be shown after the decimal point.
    - **f**: Stands for floating-point number format.
  - **%%**: Escapes the percentage symbol (`%`) so that it is displayed literally in the pie chart.





Is the plot a little small? Let's make it bigger.

In [None]:
plt.figure(figsize=(8, 8)) # Set the size of the plot
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title('Pie Chart Example with Percentage') 
plt.show()

Now I want to rotate the pie chart. We can do this by using the `startangle` parameter in the `plt.pie()` function.


In [None]:
plt.figure(figsize=(8, 8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90) # startangle rotates the chart
plt.title('Pie Chart Example with Percentage')
plt.show()

We can also explode a segment of the pie chart. This can be done by using the `explode` parameter in the `plt.pie()` function.

In [None]:
explode = (0, 0.1, 0) # This will explode the second segment

plt.figure(figsize=(8, 8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, explode=explode) # explode the second segment
plt.title('Pie Chart Example with Percentage')
plt.show()

We can also add a legend to the pie chart. This can be done by using the `plt.legend()` function.


In [None]:
plt.figure(figsize=(8, 8))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, explode=explode)
plt.title('Pie Chart Example with Percentage')
plt.legend(loc='upper right') # Add a legend to the chart
plt.show()


This is looking nice! One final modification, is turning this into a donut chart. We can do this by using the `wedgeprops` parameter in the `plt.pie()` function.

In [None]:
plt.figure(figsize=(8, 8))  
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, explode=explode, wedgeprops=dict(width=0.3)) # Create a donut chart
plt.title('Donut Chart Example with Percentage')
plt.legend(loc='upper right')
plt.show()


All we did here was add the `wedgeprops=dict(width=0.3)` parameter to the `plt.pie()` function. This parameter specifies the width of the donut chart. A value of `0.3` will create a donut chart with a width of 30% of the radius. You can adjust this value to create a donut chart with a different width.


<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Pie Chart
  </h1>

#### Part 1: Customising Labels
1. Create a new pie chart with labels `['Red', 'Blue', 'Yellow']` and sizes `[40, 35, 25]`.
2. Use the `autopct` parameter to show percentages with one decimal place.

#### Part 2: Highlight a Different Segment
1. Create a new pie chart with labels `['Group A', 'Group B', 'Group C']` and sizes `[60, 30, 10]`.
2. Highlight the third segment of the pie chart using the `explode` parameter.

#### Part 3: Donut Chart Variations
1. Create a new donut chart using labels `['X', 'Y', 'Z']` and sizes `[45, 35, 20]`.
2. Use the `wedgeprops` parameter to set the width of the donut chart to `0.5`.
3. Add a custom colour palette of your choice.

In [None]:
# Your code here

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>

#### Bar Chart
Bar charts are used to represent data in rectangular bars with lengths proportional to the values they represent. They are highly effective for comparing data across different categories. Let's create a bar chart using our randon data from above. To do this, we will use the `plt.bar()` function from the `matplotlib` library.

In [None]:
plt.figure(figsize=(8, 6))
plt.bar(labels, sizes) # Create a bar chart
plt.title('Bar Chart Example')
plt.xlabel('Categories') # Add a label to the x-axis
plt.ylabel('Values') # Add a label to the y-axis
plt.show()

Let's try and rotate the plot. We can do this by using `plt.barh()` function.

In [None]:
plt.figure(figsize=(8, 6))
plt.barh(labels, sizes) # Create a horizontal bar chart
plt.title('Horizontal Bar Chart Example')
plt.xlabel('Values')
plt.ylabel('Categories')
plt.show()

Want some different colours? We can do this by using the `color` parameter in the `plt.bar()` function.

In [None]:
plt.figure(figsize=(8, 6))
plt.bar(labels, sizes, color=['red', 'green', 'blue']) # Change the colour of the bars
plt.title('Bar Chart Example')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()

If you want it all red, you can use the `color` parameter with a single colour.

In [None]:
plt.figure(figsize=(8, 6))
plt.bar(labels, sizes, color='red') # Change the colour of the bars
plt.title('Bar Chart Example')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()

Perhaps the spacing is a little off. We can adjust this by using the `width` parameter in the `plt.bar()` function.

In [None]:
plt.figure(figsize=(8, 6))
plt.bar(labels, sizes, color='red', width=0.5) # Change the width of the bars
plt.title('Bar Chart Example')
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()

Say we have two groups of data. We can plot these side-by-side by using the `plt.bar()` function twice! Below, all I have done is create two groups of data and plotted them side-by-side.

In [None]:
categories = ['A', 'B', 'C']  # Categories for the common Y-axis
values1 = [10, 20, 30]  # Values for Group 1
values2 = [-15, -25, -35]  # Values for Group 2 (negative for opposite side)
plt.barh(categories, values1, label='Group 1')  # Plot horizontal bars for Group 1
plt.barh(categories, values2, label='Group 2')  # Plot horizontal bars for Group 2
plt.xlabel('Values') 
plt.ylabel('Categories') 
plt.title('Side-by-Side Bar Chart Example')  
plt.legend()
plt.show()

Let's now instead stack the two groups of data. Once again, we create two bars using the `plt.bar()` function, but this time we add the `bottom` parameter to the second bar. This parameter specifies the y-coordinate of the bottom of the bars. Just note that we assigned group 2 to be negative above, so we need to convert these values to positive before stacking.

In [None]:
# Convert values2 to positive for stacking
values2 = np.abs(values2)

# Create a stacked vertical bar chart
plt.bar(categories, values1, label='Group 1')  # Plot bars for Group 1
plt.bar(categories, values2, bottom=values1, label='Group 2')  # Stacked bars for Group 2 on top of Group 1
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Stacked Bar Chart Example')
plt.legend()
plt.show()

What about three groups of data clustered together? We can do this by using the `plt.bar()` function three times! Below, I have created three groups of data and plotted them together.

In [None]:
categories = ['A', 'B', 'C']  # Categories for the common X-axis
values1 = [10, 20, 30]  # Values for Group 1
values2 = [15, 25, 35]  # Values for Group 2
values3 = [20, 30, 40]  # Values for Group 3
bar_width = 0.25  # Width of the bars
# I'm going to start adding some spaces between my code for better readability - consider this for your workbook too!

plt.bar(np.arange(len(categories)), values1, width=bar_width, label='Group 1')  # Plot bars for Group 1
plt.bar(np.arange(len(categories)) + bar_width, values2, width=bar_width, label='Group 2')  # Plot bars for Group 2
plt.bar(np.arange(len(categories)) + 2 * bar_width, values3, width=bar_width, label='Group 3')  # Plot bars for Group 3
# Another space for readability. Also, don't forget to adequately comment your code! ChatGPT or similar is obvious, so use your own words!

plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Clustered Bar Chart Example')
plt.xticks(np.arange(len(categories)) + bar_width, categories)  # We set + bar_width to centre the labels
plt.legend()
plt.show()

These look great. Okay, let's try some exercises.

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Bar Chart
  </h1>

#### Part 1: Create a Horizontal Bar Chart
1. Create a new horizontal bar chart using labels `['Apple', 'Banana', 'Cherry']` and values `[15, 30, 20]`.
2. Set the bar colours to `['red', 'yellow', 'pink']`.

#### Part 2: Change Bar Colours and Add a Legend
1. Create a new grouped bar chart using labels `['G1', 'G2', 'G3']` and values `[10, 20, 15]` for the first group and `[12, 18, 25]` for the second group.
2. Assign different colours to each group and add a legend to distinguish between them.

#### Part 3: Adjust Bar Width
1. Create a new vertical bar chart with labels `['A', 'B', 'C']` and values `[40, 25, 35]`.
2. Adjust the `width` parameter to `0.3`.
3. Change the bar colours to use a single colour, such as `skyblue`.

In [None]:
# Your code here

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>

#### Line Chart
Line charts are used to represent data in a series of data points connected by straight lines. They are useful for showing trends over time. Let's create a line chart using some random data. To do this, we will use the `plt.plot()` function from the `matplotlib` library. First I need some random data. I do this by using the `np.arange()` function from the `numpy` library and the `np.random.randint()` function to generate random integers.

In [None]:
# Random Data
x = np.arange(10)  # This is an ordered array from 0 to 9
y = np.random.randint(0, 10, 10)  # We don't require ordered data for the Y-axis, so we can use random integers!

plt.plot(x, y)  # This function creates a line chart
plt.title('Line Chart Example')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

Let's add some markers to the line chart. We can do this by using the `marker` parameter in the `plt.plot()` function.

In [None]:
plt.plot(x, y, marker='o')  # Add markers to the line chart
plt.title('Line Chart Example with Markers')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

You can change the size and colour of the markers by using the `markersize` and `markerfacecolor` parameters in the `plt.plot()` function.

In [None]:
plt.plot(x, y, marker='o', markersize=10, markerfacecolor='red')  # Change the size and colour of the markers
plt.title('Line Chart Example with Markers')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

If you'd like a different colour for the line, you can use the `color` parameter in the `plt.plot()` function.

In [None]:
plt.plot(x, y, marker='o', markersize=10, markerfacecolor='red', color='green')  # Change the colour of the line
plt.title('Line Chart Example with Markers')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

Let's add a grid to the line chart. We can do this by using the `plt.grid()` function.

In [None]:
plt.plot(x, y, marker='o', markersize=10, markerfacecolor='red', color='green')
plt.title('Line Chart Example with Markers and Grid')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.grid(True)  # Add a grid to the chart
plt.show()

What if we add a solid colour to the area under the line? We can do this by using the `plt.fill_between()` function.


In [None]:
plt.plot(x, y, marker='o', markersize=10, markerfacecolor='red', color='green')
plt.fill_between(x, y, color='green', alpha=0.2)  # Add a solid colour to the area under the line
plt.title('Line Chart Example with Markers and Fill')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

Want a solid colour above the line? We can do this by using the `y2` parameter in the `plt.fill_between()` function. The `y2=max(y)+1` sets the upper boundary of the fill to just above the highest y value, allowing the area above the line to be filled with a different colour.

In [None]:
plt.plot(x, y, marker='o', markersize=10, markerfacecolor='red', color='green')
plt.fill_between(x, y, color='green', alpha=0.2)  # Area below the line
plt.fill_between(x, y, y2=max(y)+1, color='blue', alpha=0.2)  # Area above the line
plt.title('Line Chart with Markers and Fill Above/Below the Line')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

Perhaps we want to apply the solid colour just to a region between x = 2 and 4. We can do this by using the `where` parameter in the `plt.fill_between()` function.

In [None]:
plt.plot(x, y, marker='o', markersize=10, markerfacecolor='red', color='green')
plt.fill_between(x, y, where=[2 <= i <= 4 for i in x], color='red', alpha=0.2)  # Area between x=2 and x=4
plt.title('Line Chart with Markers and Fill Between x=2 and x=4')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

Want a second line on the chart? We can do this by using the `plt.plot()` function twice! Below, I have created two lines and plotted them together.

In [None]:
y2 = np.random.randint(0, 10, 10)  # Random data for the second line

plt.plot(x, y, marker='o', color='green', label='Line 1')  # Plot the first line
plt.plot(x, y2, marker='o', color='red', label='Line 2')  # Plot the second line
plt.title('Line Chart with Two Lines')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.show()

Great! Let's try some exercises.

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Line Chart
  </h1>

#### Part 1: Create a New Line Chart with More Data Points
1. Create a line chart using `x = np.arange(15)` and `y = np.random.randint(0, 20, 15)`.
2. Set the line colour to blue and add circular markers (`'o'`) at each data point.

#### Part 2: Add a Second Line
1. Create a second set of data using `y2 = np.random.randint(0, 20, 15)`.
2. Plot both `y` and `y2` on the same line chart using different colours and line styles.
3. Add a legend to label each line.

#### Part 3: Highlight Specific Data Points
1. Create a line chart using `x = np.arange(10)` and `y = np.random.randint(0, 15, 10)`.
2. Use the `marker='D'` and `markersize=8` parameters to add diamond markers at each data point.
3. Change the marker colour to `orange`.

In [None]:
# Your code here

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>

#### Scatter Plot
Scatter plots are used to represent data as a collection of points. They are useful for visualising the relationship between two variables. Let's create a scatter plot using some random data. To do this, we will use the `plt.scatter()` function from the `matplotlib` library. First I need some random data. I do this by using the `np.random.randn()` function from the `numpy` library to generate random numbers.

In [None]:
# Random Data
x = np.random.randn(100)  # Random data for the x-axis
y = np.random.randn(100)  # Random data for the y-axis

plt.scatter(x, y)  # This function creates a scatter plot
plt.title('Scatter Plot Example')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

What about a line of best fit? We can do this by using the `np.polyfit()` function from the `numpy` library to calculate the coefficients of the line of best fit. We can then use the `np.poly1d()` function

In [None]:
coefficients = np.polyfit(x, y, 1)
line = np.poly1d(coefficients)

The line `coefficients = np.polyfit(x, y, 1)` fits a linear regression (line of best fit) to the data by finding the slope and intercept, while `line = np.poly1d(coefficients)` creates a polynomial function representing that line, which can be used to plot the best fit across the data points.

In [None]:
plt.scatter(x, y)
plt.plot(x, line(x), color='red')  # This is our line of best fit
plt.title('Scatter Plot with Line of Best Fit')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

Let's add a second set of random data to the scatter plot. We can do this by using the `plt.scatter()` function twice! Below, I have created two sets of random data and plotted them together.


In [None]:
x2 = np.random.randn(100)  # Random data for the x-axis
y2 = np.random.randn(100)  # Random data for the y-axis

plt.scatter(x, y, color='red', label='Group 1')  # Plot the first set of data
plt.scatter(x2, y2, color='blue', label='Group 2')  # Plot the second set of data
plt.title('Scatter Plot with Two Groups')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
plt.show()

Want to change the size of the markers? We can do this by using the `s` parameter in the `plt.scatter()` function.


In [None]:
plt.scatter(x, y, s=100)  # Change the size of the markers
plt.title('Scatter Plot Example with Larger Markers')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

If there's a third variable, we can change the colour of the markers. We can do this by using the `c` parameter in the `plt.scatter()` function.


In [None]:
z = np.random.randint(50, 150, size=100)  # Random values for marker size
plt.scatter(x, y, c=z, cmap='cool')  # Change the colour of the markers
plt.title('Scatter Plot Example with Colour')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.colorbar()  # Add a colour bar
plt.show()

Or perhaps change the size of the markers based on the third variable. We can do this by using the `s` parameter in the `plt.scatter()` function.


In [None]:
plt.scatter(x, y, s=z)  # Change the size of the markers (here I multiplied by 100 for better visualisation)
plt.title('Scatter Plot Example with Size')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

Let's try some exercises.

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Scatter Plot
  </h1>

#### Part 1: Create a Scatter Plot
1. Create a scatter plot using `x = np.random.randn(50)` and `y = np.random.randn(50)`.
2. Set the marker size to `50` and the colour to `green`.

#### Part 2: Add a Second Set of Data
1. Create a second scatter plot using `x2 = np.random.randn(50)` and `y2 = np.random.randn(50)`.
2. Plot both datasets on the same chart using different colours.
3. Add labels and a legend to distinguish between the two groups.

#### Part 3: Create a Colour-Coded Scatter Plot
1. Create a new scatter plot using `x = np.random.randn(100)`, `y = np.random.randn(100)`, and a third variable `z = np.random.randint(10, 100, 100)`.
2. Use the `c=z` parameter and set `cmap='plasma'`.
3. Add a colour bar using `plt.colorbar()` to indicate the range of values.

In [None]:
# Your code here

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>

#### Histogram
Histograms are used to represent the distribution of a continuous variable. They are useful for visualising the frequency of data points within a specific range. Let's create a histogram using some random data. To do this, we will use the `plt.hist()` function from the `matplotlib` library. First I need some random data. I do this by using the `np.random.randn()` function from the `numpy` library to generate random numbers.

In [None]:
# Random Data
data = np.random.randn(1000)  # Random data for the histogram where 1000 is the number of data points

plt.hist(data)  # This function creates a histogram
plt.title('Histogram Example')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

What about changing the number of bins to control the granularity of the histogram? We can do this by using the `bins` parameter in the `plt.hist()` function.


In [None]:
plt.hist(data, bins=20)  # Change the number of bins
plt.title('Histogram Example with 20 Bins')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

Instead of bars, we can use step-type histograms. We can do this by using the `histtype` parameter in the `plt.hist()` function.

In [None]:
plt.hist(data, bins=20, histtype='step')  # Use a step-type histogram
plt.title('Histogram Example with Step')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

Another histtype is 'stepfilled'. This is similar to 'step', but the area between the steps is filled. We can do this by using the `histtype` parameter in the `plt.hist()` function.


In [None]:
plt.hist(data, bins=20, histtype='stepfilled')  # Use a stepfilled histogram
plt.title('Histogram Example with Step Filled')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

We can add a KDE (Kernel Density Estimate) to the histogram. A KDE is a smoothed version of the histogram that can provide additional insights into the data distribution. We need the Seaborn library for this. We can do this by using the `sns.histplot()` function from the `seaborn` library instead of the `plt.hist()` function. 

In [None]:
sns.histplot(data, bins=20, kde=True)  # Use seaborn's histplot for KDE
plt.title('Histogram Example with KDE')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

Let's add a second set of random data to the histogram. We can do this by using the `plt.hist()` function twice! Below, I have created two sets of random data and plotted them together.


In [None]:
data2 = np.random.randn(100)  # Random data for the second histogram

plt.hist(data, bins=20, alpha=0.5, label='Group 1')  # Plot the first set of data with an alpha value of 0.5 to make it transparent
plt.hist(data2, bins=20, alpha=0.5, label='Group 2')  # Plot the second set of data
plt.title('Histogram with Two Groups')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Notice that the second set of data is not as tall as the first set. This is because the number of data points in the second set is less than the first set. We can normalise the histogram to account for this. We can do this by using the `density` parameter in the `plt.hist()` function. Keep in mind that you should adequately communicate this normalisation in your visualisation!

In [None]:
plt.hist(data, bins=20, alpha=0.5, label='Group 1', density=True)  # Normalise the first set of data
plt.hist(data2, bins=20, alpha=0.5, label='Group 2', density=True)  # Normalise the second set of data
plt.title('Normalised Histogram with Two Groups')
plt.xlabel('Values')
plt.ylabel('Density')
plt.legend()
plt.show()

What about a stacked histogram? We can do this by using the `stacked` parameter in the `plt.hist()` function.

In [None]:
plt.hist([data, data2], bins=20, alpha=0.5, label=['Group 1', 'Group 2'], stacked=True)  # Create a stacked histogram
plt.title('Stacked Histogram with Two Groups')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Prefer to go horizontal? We can do this by using the `orientation` parameter in the `plt.hist()` function.


In [None]:
plt.hist(data, bins=20, orientation='horizontal')  # Create a horizontal histogram
plt.title('Horizontal Histogram Example')
plt.xlabel('Frequency')
plt.ylabel('Values')
plt.show()

What about a cumulative histogram? This plot is a way to show the running total of data points as you move across the values. Instead of displaying the number of data points in each bin, a cumulative histogram shows how many data points fall *up to* a certain value. You can create a cumulative histogram by using the `cumulative=True` parameter in the `plt.hist()` function.

In [None]:
plt.hist(data, bins=20, cumulative=True)  # Create a cumulative histogram
plt.title('Cumulative Histogram Example')
plt.xlabel('Values')
plt.ylabel('Cumulative Frequency')
plt.show()


<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Histogram
  </h1>

#### Part 1: Create a Histogram with Different Bin Sizes
1. Generate random data using `data = np.random.randn(500)`.
2. Create a histogram with `30` bins.
3. Set the bar colour to `lightcoral`.

#### Part 2: Create a Cumulative Histogram
1. Generate a new set of random data using `data = np.random.randn(1000)`.
2. Create a cumulative histogram using `plt.hist(data, cumulative=True, bins=25)`.
3. Set the bar colour to `steelblue`.

#### Part 3: Create a Stacked Histogram
1. Generate two different sets of random data: `data1 = np.random.randn(500)` and `data2 = np.random.randn(500)`.
2. Plot them together using the `stacked=True` parameter.
3. Set different colours for each dataset using `color=['salmon', 'skyblue']`.

In [None]:
# Your code here

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>

#### Visualising Uncertainty and Spread

In previous sessions, we discussed how to calculate and interpret uncertainty and spread using statistical measures such as standard deviation, variance, and confidence intervals. Now, we will look at how to visualise these uncertainties in your plots to make your data communication more effective.

#### Why Include Uncertainty in Visualisations?

When working with real-world data, there’s almost always some uncertainty in measurements or estimates. Including this information in your visualisations helps convey the reliability of your findings and avoids misleading interpretations. For example:

- Error bars show how much a particular value might vary due to measurement or sampling errors.
- Shaded regions around a line indicate confidence intervals or uncertainty bands, giving a clearer picture of the range of possible outcomes.
- Boxplots and violin plots are used to represent the spread of the data, helping to visualise distributions and identify outliers.

#### How Do We Add Uncertainty to Visualisations?

Below are some common methods for visualising uncertainty and spread. We'll start with error bars in bar charts and line charts. 

In [None]:
categories = ['A', 'B', 'C']
values = [5, 7, 3] # Sample mean values
std_dev = [0.5, 0.7, 0.3]  # Standard deviation values calculates elsewhere

plt.bar(categories, values, yerr=std_dev, capsize=5, color='skyblue')
plt.title('Bar Chart with Error Bars')
plt.show()

Another options is shaded regions for line charts. We can do this by using the `plt.fill_between()` function. Below, I have created a line chart with shaded regions, which represent the uncertainty in the data. This could be either spread (e.g., standard deviation) or uncertainty around a central value.


In [None]:
x = np.linspace(0, 10, 100) # 100 points from 0 to 10
y = np.sin(x) # sine function
error = 0.1 + 0.1 * np.sqrt(x) # here I am just creating some random error values

plt.plot(x, y, color='blue')
plt.fill_between(x, y - error, y + error, color='blue', alpha=0.2) # we set above and below the line
plt.title('Line Chart with Shaded Regions')
plt.show()

We can also use boxplots to visualise the spread of data. Below, I have created a boxplot to show the spread of data for different categories. I will start by creating some random data and saving it in a DataFrame. I will then use the `sns.boxplot()` function from the `seaborn` library to create the boxplot. This automatically calculates the quartiles and the interquartile range (IQR) for each category.

In [None]:
data = pd.DataFrame({
    'Category': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Values': [5, 7, 6, 8, 7, 6, 4, 3, 5]
})

sns.boxplot(x='Category', y='Values', data=data)
plt.title('Boxplot of Categories')
plt.show()

The boxplot components:
- The line inside the box represents the median.
- The box represents the interquartile range (IQR), which contains the middle 50% of the data.
- The whiskers extend to the minimum and maximum values within 1.5 times the IQR.
- Any points outside the whiskers are considered outliers.


Let's create some error bars on scatter plots. We can do this by using the `plt.errorbar()` function. Below, I have created a scatter plot with error bars to show the uncertainty or spread in the data.


In [None]:
x = np.arange(10)  # 10 data points
y = np.random.randn(10)  # Random y-values
error = np.random.rand(10) * 0.5  # Random error values

plt.errorbar(x, y, yerr=error, fmt='o', color='green', ecolor='red', capsize=5)  # The yerr parameter sets the error values
plt.title('Scatter Plot with Error Bars')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

You can change the width of the error bars by using the `capsize` parameter in the `plt.errorbar()` function. Below, I have increased the width of the error bars to make them more visible.

In [None]:
plt.errorbar(x, y, yerr=error, fmt='o', color='green', ecolor='red', capsize=10)  # Increase the width of the error bars
plt.title('Scatter Plot with Error Bars')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

Let's remove the markers and make the colour lighter. We can do this by using the `fmt` parameter in the `plt.errorbar()` function, and setting the `color` parameter to a lighter shade. We also set the cap size to 0 to remove the caps.


In [None]:
plt.errorbar(x, y, yerr=error, fmt='o', color='black', ecolor='lightgray', elinewidth=3, capsize=0)
plt.title('Scatter Plot with Error Bars')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()

In the previous examples, I either used made up errors or used **standard deviation (SD)**. Recall that SD is a measure of **spread** — it tells us how much individual data points vary around the mean. This is useful for understanding variability within a single dataset.

However, other measures can be used to indicate **uncertainty**:

- **Standard Error (SE)**: Measures the variability of the **sample mean** rather than the spread of the data. It decreases as sample size increases, showing increased confidence in the mean. You can calculate it as `SE = SD / sqrt(n)` where `n` is the sample size.
  
- **Confidence Intervals (CI)**: Show a range of values within which the true mean is likely to fall, providing a clearer view of **mean reliability**. Refer to the lecture for more details on how to calculate CI.

- **Other Measures**: Depending on your data, you might choose **percentiles**, **prediction intervals**, or **bootstrap estimates**.

**Choosing the Right Measure**

The choice of measure should match your **data context** and the **message** you want to convey:

- Use **Standard Deviation** to show **data spread**.
- Use **Standard Error** or **Confidence Intervals** to show **uncertainty around a mean**.

**Key Tips:**

1. **Always Label Your Error Bars** — Specify if you’re using SD, SE, or CI in your legend or title.
2. **Choose the Right Measure** — Decide if you want to show data spread or uncertainty.
3. **Avoid Misinterpretation** — Using the wrong measure can mislead your audience.

The following image illustrates the differences between **spread** (variability within the sample) and **error** (variability of the sample mean):

<img src="https://mjcowley.github.io/images/spread_error.png" alt="Spread vs Error" width="600"/>

1. **Top Panel – Population Distribution:**
   - Shows the **population mean** and **standard deviation (SD)**, representing the spread of the entire population.
   - The SD tells us how much the individual data points are likely to vary from the mean.

2. **Middle Panel – Sample Distribution:**
   - Displays a random sample taken from the population.
   - The **sample mean** is different from the true population mean due to sampling variability.
   - The **sample standard deviation** is used here to estimate the spread of the sample values.

3. **Bottom Panel – Sampling Distribution of the Mean:**
   - This panel highlights **standard error (SE)**.
   - SE measures the variability of the **sample mean** if we repeatedly took samples from the population.
   - As more samples are collected, SE decreases, showing increased confidence in our mean estimate.
   - In contrast to SD, SE reflects **uncertainty** in the estimate of the mean rather than the spread of individual data points.

By understanding these differences, you can make better decisions about which measure to use when plotting your data and effectively communicate your findings. Let's end with an exercise.



<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    Exercise: Visualising Uncertainty and Spread
  </h1>

In this activity, you will create a line plot of global temperature anomalies over time. You will also calculate a rolling standard deviation to show how temperature variability changes over the years. Let's make use of our global temperature dataset to visualise these trends. Here's what you need to do:

1. **Read the Data:**
   - Use the provided CSV file (`global_temp_anomalies.csv`), which contains two columns:
     - `Year`: The year of observation.
     - `Anomaly`: The global temperature anomaly for that year.

2. **Create a Line Plot:**
   - Plot a line graph of the `Anomaly` values against the `Year` to visualise the trend in temperature anomalies over time. Michael showed an example of this in an earlier lecture.

3. **Calculate a Rolling Standard Deviation:**
   - Use a rolling window (e.g., 10 years) to calculate the standard deviation of the anomalies. You can do this using the `rolling()` function in `pandas`. I provide it here for you: `rolling_std = data['Anomaly'].rolling(window=10).std()`.
   - Add this rolling standard deviation as a shaded region around the line plot to show how variability has changed over time.

4. **Customise the Plot:**
   - Add labels, a title, and a legend to make your visualisation more informative.
   
5. **Interpret the Visualisation:**
   - What does the trend in the anomalies tell you about global temperature changes?

<div class="list-group" id="list-tab" role="tablist" style="text-align: center;">
  <h1 class="list-group-item list-group-item-action active" data-toggle="list" style="background: white; color: black; border: 0; padding: 20px 0; display: inline-block; width: 90%; box-sizing: border-box;">
    End of Exercise
  </h1>

# Further Exploration and Practice

Congratulations on completing your introuction to Python in Google Colab! You've made a significant first step towards mastering Python for data analysis and visualisation. To bolster your understanding and prepare for what's ahead, consider exploring the following resources:

- **Codecademy's Python Course:** An interactive platform for learning Python with a hands-on approach. Perfect for reinforcing what you've learned and building on it. [Codecademy's Python Course](https://www.codecademy.com/learn/learn-python-3).
- **Kaggle's Python Course:** Focused on data science applications, this course is ideal for those looking to delve into data manipulation and analysis. [Kaggle's Python Course](https://www.kaggle.com/learn/python).
- **Real Python:** Provides a wealth of tutorials and exercises, beneficial for all levels of Python developers. [Real Python](https://realpython.com/).
- **Python Official Documentation:** For in-depth learning, nothing beats the [Python official documentation](https://docs.python.org/3/). Use it to clarify concepts and learn about new features of the language.
- **Towards Data Science - Data Cleaning:** This guide offers practical tips and examples for cleaning your data in Python. [Towards Data Science - Data Cleaning](https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4)
- **Kaggle Data Cleaning Challenge:** Kaggle's micro-courses and challenges are great for hands-on practice in data cleaning and manipulation. [Kaggle Data Cleaning Challenge](https://www.kaggle.com/learn/data-cleaning)
- **Pandas Documentation:** Deep dive into the Pandas documentation to understand the full capabilities of this powerful library. [Pandas Documentation](https://pandas.pydata.org/docs/)
- **Data Science Central - Data Preparation:** Explore articles and tutorials that discuss various aspects of data preparation and best practices. [Data Science Central - Data Preparation](https://www.datasciencecentral.com/profiles/blogs/data-preparation)
- **Python Graph Gallery: Visualisation Catalogue:** This gallery provides a wide range of visualisation examples and code snippets for inspiration. [Python Graph Gallery](https://www.python-graph-gallery.com/)

Keep practicing and experimenting with code to solidify your skills. Happy coding!
