# Getting Data from a CSV File

Open this notebook in [Callysto](https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https://github.com/pbeens/Data-Dunkers&branch=main&subPath=Demos/where-can-we-get-data-from-csv.ipynb&depth=1) | [Colab](https://githubtocolab.com/pbeens/Data-Dunkers/blob/main/Demos/where-can-we-get-data-from-csv.ipynb).

## Program Setup

This first code block may have to be run if these libraries haven't already been installed. Once this has been done once, it will never have to be done again. You can skip it for now, but if you get an error message related to a library not being installed, go ahead and run it.

In [1]:
%pip install pandas -q
%pip install plotly.express -q

Note: you may need to restart the kernel to use updated packages.


DEPRECATION: Loading egg at c:\python311\lib\site-packages\vboxapi-1.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330


Note: you may need to restart the kernel to use updated packages.


DEPRECATION: Loading egg at c:\python311\lib\site-packages\vboxapi-1.0-py3.11.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330


## Introduction

There are many ways we can import data, but the most common are from the program itself, a CSV (comma separated values) file, from an Excel spreadsheet, from a Google Sheet, or from a webpage. 

In this demo, we will demonstrate how to get data from within the Jupyter Notebook itself.

In our first example, we got our data from within the Jupyter Notebook itself. This method can be used but it is not very common.

A more common method is to get the data from *outside* the program, with the  **CSV** file format being one of the most common. 

In this example program, we first import the **Pandas** library using `import pandas as pd` (we still need `plotly.express` so that's imported as well). We then use the `pd.read_csv()` function to read the [CSV file](https://raw.githubusercontent.com/pbeens/Data-Dunkers/main/Data/x-y-data.csv) into a **Pandas DataFrame**. 

Note that we are using a variable called `URL` this time. This often makes the program easier to read.

## Setup & Input

In [2]:
# import plotly.express and pandas
import plotly.express as px
import pandas as pd

# Read the CSV file into a DataFrame named df
url = 'https://raw.githubusercontent.com/pbeens/Data-Dunkers/main/Data/x-y-data.csv'
df = pd.read_csv(url)

## Process

Just for fun, let's look at the top few lines of data we just imported. We use the Pandas `head()` function for this:

In [3]:
# Display the first 5 rows of the data
print(df.head())

   X   Y
0  0   0
1  1   1
2  2   4
3  3   9
4  4  16


What about the bottom rows? (Let's only look at the bottom 2 rows)

In [4]:
# Display the last 2 rows of the data
print(df.tail(2))

   X   Y
4  4  16
5  5  25


You'll see that Pandas has inserted an index column before the data. We won't worry about that at this time because it won't affect us here.

Besides using `head()` to have a quick look at the data, data scientists also often look at what columns are included in the datafile. To do that, we use the `df.columns` attribute. Here's how:

In [5]:
print(df.columns)

Index(['X', 'Y'], dtype='object')


It tells us there are two columns: 'X' and 'Y'. The case of the letters is important, so always pay attention to that. 

## Output

And now let's plot it. Notice that this is the exact same code as when we plotted the [internal data](where-can-we-get-data-from-internal.ipynb).

In [6]:
# Create the plot
fig = px.line(data_frame=df, 
    x='X', 
    y='Y', 
    title='Data from a CSV file')

# Show the plot
fig.show()

The important difference from using internal list data is that we  have to identify the dataframe we want to use before telling it the names of the columns we want to use:

> data_frame=df, > x='X', y='Y'

Putting it all together, we have:

In [7]:
# Setup
import plotly.express as px
import pandas as pd

url = 'https://raw.githubusercontent.com/pbeens/Data-Dunkers/main/Data/x-y-data.csv'

# Input
df = pd.read_csv(url)

# Process
fig = px.line(data_frame=df, 
    x='X', 
    y='Y', 
    title='Data from a CSV file')

# Output
fig.show()

**What if we want to change the names of the columns?**

If there are just a few columns you can simply reassign them like this:

In [8]:
df.columns = ['X Value', 'X^2']
display(df.columns)

Index(['X Value', 'X^2'], dtype='object')

If there a lots of columns and you only want to rename a few columns, you can use this technique which uses Python [*dictionaries*](https://www.w3schools.com/python/python_dictionaries.asp):

In [9]:
# Old method - inplace is being phased out in the new Pandas
# df.rename(columns={'X Value': 'X', 
#                    'X^2': 'Y'}, 
#                     inplace=True)

# Preferred method
df = df.rename(columns={'X Value': 'X', 
                        'X^2': 'Y'}) 

df.columns

Index(['X', 'Y'], dtype='object')

**Assignment**

Put the whole program together in the cell below, including renaming the columns as shown above. The only output should be the plot.

---
In our next demonstration we will get our data from an [Excel](where-can-we-get-data-from-excel.ipynb) file. ([GitHub link](https://github.com/pbeens/Data-Dunkers/blob/main/Demos/where-can-we-get-data-from-excel.ipynb))