<a href="https://colab.research.google.com/github/mokymok/notebooks/blob/main/pandas_loadingdata_lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
# Lambda School Data Science - Intro to Pandas 
---
# Lecture 05 - Loading Data 
---



### Begin by importing your tools

In [None]:
# Let's begin by importing pandas
import pandas as pd

In [None]:
# help(pd.read_csv)

In [None]:
?pd.read_csv

## Reading in Your Data
Pandas has methods to read different types of data files and turn them into a DataFrame. In practice, you could be pulling data from different databases, different websites, your local computer, or a combination of all the above. For our purposes in the pre-course, we'll focus on the most common pandas method:  
**`pandas.read_csv()`**

CSV stands for "comma separated values." If you've ever used Microsoft Excel or Google Sheets, for example, you've used CSV files. Keep this link to the documentation handy. Learning how to read documentation will become an invaluable asset in your journey.
[Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

### Load a Dataset via its URL

In [None]:
# We use the pandas method pandas.read_csv("filepath") to create a DataFrame 
# and assign it to a variable:

df = pd.read_csv("https://raw.githubusercontent.com/axrd/datasets/master/sales_data.csv")

# Why do you think we assigned the new DataFrame to a variable? 

print(df.shape)
df.head()

(100, 6)


Unnamed: 0.1,Unnamed: 0,Name,Region,Company,Date,Sales
0,0,Brenden Cote,Central African Republic,Metus Corp.,"Feb 17, 2018",67044
1,1,Justina Reed,Namibia,Lobortis Ltd,"Apr 27, 2017",89517
2,2,Daquan Vinson,Svalbard and Jan Mayen Islands,A Mi Consulting,"Aug 14, 2016",62705
3,3,Connor Shelton,Niue,Parturient Consulting,"Feb 13, 2017",12675
4,4,Drew Carlson,Mayotte,Interdum Associates,"Sep 26, 2017",86670


### Load a Dataset from your local machine

Take a look at the folder where this dataset can be found on GitHub: [sales_data.csv](https://github.com/axrd/datasets/blob/master/sales_data.csv).

In [None]:
from google.colab import files
files.upload()

In [None]:
df = pd.read_csv('sales_data.csv')

print(df.shape)
df.head()

## Heads or Tails?
The cell ran...but how do we know if we got our data?
Pandas has a few methods that can help us verify:  

`dataframe.head()`  

Will show us the ***first*** 5 rows of our dataframe, we can pass an integer value into the parenthesis to see a specific number of rows.

`dataframe.tail()`

Will show us the ***last*** 5 rows of our dataframe, we can pass an integer value into the parenthesis to see a specific number of rows.


In [None]:
# Look at the first 5 rows:
df.head()

In [None]:
# Look at the last 5 rows:

df.tail()

Hmm...it looks like we may have two indexes. We'll fix that a little later. 

Right now, notice that df.head() and df.tail() return the first five rows and the last five rows of the DataFrame, respectively. But what if I wanted to see the first ten rows?

In [None]:
df.tail(10)

The head() method takes in a parameter that tells it how many rows to return (from the top). The tail() method has the same parameters, but from the bottom instead. 

From our head() and tails() output we can see that we have people, regions, companies, dates, and sales numbers. Let's say we knew that this data was supposed to have 100 rows and 5 columns. How could we verify that we imported the data correctly?

In [None]:
# Pandas DataFrames have the same useful attribute as Numpy ndarrays: shape.
df.shape

We already knew this, but we've got one extra column (the duplicate index). Don't worry, this can happen! Fortunately, data frames have a method to drop columns.


In [None]:
# Use the dataframe.columns attribute to get a series of the column names:
df.columns

In [None]:
# If we just want a list of the column headers we can cast the index to a list
list(df.columns)

In [None]:
# We need to drop the first column, but it's Unnamed so we need to use its index value.
# Remember that index values begin at 0!

df.columns[0]

In [None]:
# Using the dataframe.drop() method:

# The the following two lines of code are equivalent
df.drop(df.columns[0])
# df.drop('Unnamed: 0')

### Axis? What's that all about?  
### This:
![atext](https://i.stack.imgur.com/dcoE3.jpg)  
If you check the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html), you'll see that DataFrame.drop() has "axis = 0" as a default. We need to explicitly (remember the Zen of Python?) tell pandas to look for the column we want to drop from the column axis, which is column 1.

In [None]:
# Second time's the charm!

df.drop(df.columns[0], axis=1)

In [None]:
# We did it! Wait... Why is it still there? 🤔

df.head()

### Not quite...
### Pandas will almost always perform the operation and then return a copy of the changed DataFrame. But the original DataFrame is still the same. We need to override that! 
We can accomplish this a few different ways; the easiest is making the change _in place_.

In [None]:
df.drop(df.columns[0], axis=1, inplace=True)

In [None]:
# I suggest using this approach: 
df = df.drop(df.columns[0], axis=1)

In [None]:
# Third time's the charm?
df.head()

## Renaming a Column
On closer inspection, the column "Region" should really be called "Country." Let's make that happen, _in place_.

In [None]:
df.rename(columns={'Region':'Country'}, inplace=True)

In [None]:
df.head()

### Let's get started on your assignment!