# Getting Data From a CSV File

Open this notebook in [Callysto](https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https://github.com/pbeens/Data-Dunkers&branch=main&subPath=Demos/data-from-csv.ipynb&depth=1) | [Colab](https://githubtocolab.com/pbeens/Data-Dunkers/blob/main/Demos/data-from-csv.ipynb).

# Lesson Objectives

By the end of this lesson, students will be able to:
- Utilize the Pandas library to load data from a CSV file into a DataFrame.
- Display the top and bottom rows of data using the `head()` and `tail()` functions in Pandas.
- Identify and manipulate column names within a DataFrame using the `columns` attribute.
- Create a line plot using Plotly Express by specifying data frames and column mappings.
- Recognize the importance of accurately specifying column names in data analysis and visualization tasks.

## Program Setup 

This first code block may have to be run if these libraries haven't already been installed. Once this has been done once, it will never have to be done again. You can skip it for now, but if you get an error message related to a library not being installed, go ahead and run it.

In [None]:
%pip install pandas -q
%pip install plotly.express -q

## Introduction

There are many ways we can import data, but the most common are from the program itself, a CSV (comma separated values) file, from an Excel spreadsheet, from a Google Sheet, or from a webpage. 

In this demo, we will demonstrate how to get data from a CSV file.

## Setup & Input

In our first example, we got our data from within the Jupyter Notebook itself. This method can be used but it is not very common. A more common method is to get the data from *outside* the program, with the  **CSV** file format being one of the most common. 

In this example program, we first import the **Pandas** library using `import pandas as pd` (we still need `plotly.express` so that's imported as well). We then use the `pd.read_csv()` function to read the [CSV file](https://raw.githubusercontent.com/pbeens/Data-Dunkers/main/Data/x-y-data.csv) into a **Pandas DataFrame**. 

Note that we are using a variable called `URL` this time. This often makes the program easier to read.

In [None]:
# import plotly.express and pandas
import plotly.express as px
import pandas as pd

# Read the CSV file into a DataFrame named df
url = 'https://raw.githubusercontent.com/pbeens/Data-Dunkers/main/Data/x-y-data.csv'
df = pd.read_csv(url)

## Process

Just for fun, let's look at the top few lines of data we just inputted. We use the Pandas `head()` function for this:

In [None]:
# Display the first 5 rows of the data
print(df.head())

What about the bottom rows? (Let's only look at the bottom 2 rows)

In [None]:
# Display the last 2 rows of the data
print(df.tail(2))

You'll see that Pandas has inserted an index column before the data. We won't worry about that at this time because it won't affect us here.

Besides using `head()` to have a quick look at the data, data scientists also often look at what columns are included in the datafile. To do that, we use the `df.columns` attribute. Here's how:

In [None]:
# Display the column names
print(df.columns)

It tells us there are two columns: 'X' and 'Y'. The case of the letters is important, so always pay attention to that. 

## Output

And now let's plot it. Notice that this is the exact same code as when we plotted the [internal data](where-can-we-get-data-from-internal.ipynb).

In [None]:
# Create the plot
fig = px.line(data_frame=df, 
    x='X', 
    y='Y', 
    title='Data from a CSV file')

# Show the plot
fig.show()

The important difference from using internal list data is that we  have to identify the dataframe we want to use before telling it the names of the columns we want to use:

> data_frame=df, > x='X', y='Y'

Putting it all together, we have:

In [None]:
# Setup
import plotly.express as px
import pandas as pd

# Input
url = 'https://raw.githubusercontent.com/pbeens/Data-Dunkers/main/Data/x-y-data.csv'
df = pd.read_csv(url)

# Process
fig = px.line(data_frame=df, 
    x='X', 
    y='Y', 
    title='Data from a CSV file')

# Output
fig.show()

**What if we want to change the names of the columns?**

If there are just a few columns you can simply reassign them like this:

In [None]:
# Changing column names
df.columns = ['X Value', 'X^2']
display(df.columns)

If there a lots of columns and you only want to rename a few columns, you can use this technique which uses Python [*dictionaries*](https://www.w3schools.com/python/python_dictionaries.asp):

In [None]:
# Old method - inplace is being phased out in the new Pandas
# df.rename(columns={'X Value': 'X', 
#                    'X^2': 'Y'}, 
#                     inplace=True)

# Preferred method with lots of columns
df = df.rename(columns={'X Value': 'X', 
                        'X^2': 'Y'}) 

display(df.columns)

**Assignment**

Put the whole program together in the cell below, including renaming the columns as shown above. The only output should be the plot.

In [None]:
# Setup


# Input


# Process


# Output

# Exercise

Using the code above as an example, use the data below to plot Pascal Siakam's field goals made over his Raptors career. 

In [None]:
# Setup


# Input
url = 'https://raw.githubusercontent.com/pbeens/Data-Dunkers/main/Data/example.csv'


# Process


# Output


---
In our next demonstration we will get our data from an [Excel](https://github.com/pbeens/Data-Dunkers/blob/main/Demos/data-from-excel.ipynb) file.

---
*Report issues or give us feedback about this notebook [here](https://docs.google.com/forms/d/e/1FAIpQLSdMRX2hPqZyD8-argFJXxB3ABQdLk3aUH1CAfmMEtcFAlWzCw/viewform?usp=pp_url&entry.1771525592=Module%20Resources%20%28the%20Jupyter%20notebooks%2C%20PPTS%20or%20additional%20resources%29&entry.1364186163=Data%20From%20a%20CSV%20File).*

---
Back to [Lessons](https://github.com/pbeens/Data-Dunkers/blob/main/Lessons.ipynb)