# Working with External Data

CSV, or Comma Separate Values, is the most common way of sharing external datasets. 

In [2]:
import os # Using os.path.join b/c Windows uses a \ to separate directories
import pandas as pd

In [3]:
users_file_name = os.path.join('data', 'users.csv')
users_file_name

'data/users.csv'

In [5]:
# Opens the file and prints out the first 5 lines
with open(users_file_name) as lines:
    for _ in range(5):
        # The `file` object is an iterator, so just get the next line
        print(next(lines))

﻿,first_name,last_name,email,email_verified,signup_date,referral_count,balance

aaron,Aaron,Davis,aaron6348@gmail.com,TRUE,8/31/18,6,18.14

acook,Anthony,Cook,cook@gmail.com,TRUE,5/12/18,2,55.45

adam.saunders,Adam,Saunders,adam@gmail.com,FALSE,5/29/18,3,72.12

adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,TRUE,4/28/18,3,30.01



## Exercise One - Creation
Notice that the first line is a header row - this is something I ignored in the first round of working with the Invs. Institute's data that Subrat fixed when he replicated my dataFrame in Pandas.<br>

By default, it'll be assumed that the first row is my header row. To create a new DataFrame to explore the data - i.e., replicating my data in an excel-like environment, I need to use the `pandas.read_csv()` function, along with the `index_col=` argument. 

In [8]:
# Creates a new DataFrame and sets the index to the first column.
users = pd.read_csv(users_file_name, index_col=0)

To explore the dataFrame, I can use the `DataFrame.head()` method that gives me the first `x` number of rows. By default, the head method returns 5 records. I can specify the number I want as the first argument, like so: `variable_DataFrame_name.head(10)`

In [9]:
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aaron,Aaron,Davis,aaron6348@gmail.com,True,8/31/18,6,18.14
acook,Anthony,Cook,cook@gmail.com,True,5/12/18,2,55.45
adam.saunders,Adam,Saunders,adam@gmail.com,False,5/29/18,3,72.12
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,4/28/18,3,30.01
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,6/16/18,7,25.85


## Exercise Two - Exploration 
To explore data, I can start with three types of information: 1) shape information, 2) data information, and 3) sorting.<br>

Always __remember__: I use the `variable_name.(Optional Method Here(Its Parameters)).head()` function to __see__ the data itself __after__ I've done something to it. 

### Shape Information 
* `variable_name.shape` property
* `variable_name.count()` method (good way to find missing data) 
    * Missing data will show up as np.Nan
    * Missing data __will not be counted
* `variable_name.dtypes` property 
    * strings show up as 'object'
    * bool shows up as 'bool'
    * integers show up as 'float64, int64, or the like' 

In [19]:
users.shape

(475, 7)

In [20]:
users.count()

first_name        475
last_name         430
email             475
email_verified    475
signup_date       475
referral_count    475
balance           475
dtype: int64

In [21]:
users.dtypes

first_name         object
last_name          object
email              object
email_verified       bool
signup_date        object
referral_count      int64
balance           float64
dtype: object

### Data Information
The `DataFrame_varName.describe()` method is a way to get a sense of all the numeric data in the DataFrame.<br>

I can also use the `varName.mean()` method to aggregate all the variables and find the mean of a certain column. I don't __need__ to specify the column name either. TEST: Try seeing if I can pass in the name of a column as an argument. RESULT: Nope, but I can add it as a specifier after the variable name like so: `varName.specificer.mean()`. 
* I can also find the standard deviation of a column using `varName.std()`
* Find the minimum and max using `varName.max()` or `varName.min()`

__NOTE:__ By default, the value counts are sorted in descending order - so the most frequent are at the top. Therefore, I can use `varName.specifier.value_counts().head()` to find the five most common values in the specifier column. 

In [22]:
users.describe()

Unnamed: 0,referral_count,balance
count,475.0,475.0
mean,3.429474,49.933263
std,2.281085,28.280448
min,0.0,0.05
25%,2.0,25.305
50%,3.0,51.57
75%,5.0,74.48
max,7.0,99.9


In [23]:
users.mean()

email_verified     0.818947
referral_count     3.429474
balance           49.933263
dtype: float64

In [25]:
# Test to see if I can find a single column's mean
users.email_verified.mean()

0.8189473684210526

In [26]:
users.std()

email_verified     0.385468
referral_count     2.281085
balance           28.280448
dtype: float64

In [27]:
users.max()

first_name                Zachary
email             zneal@gmail.com
email_verified               True
signup_date                9/9/18
referral_count                  7
balance                      99.9
dtype: object

In [28]:
users.min()

first_name                       Aaron
email             aalvarez@hotmail.com
email_verified                   False
signup_date                     1/1/18
referral_count                       0
balance                           0.05
dtype: object

In [29]:
users.email_verified.value_counts()

True     389
False     86
Name: email_verified, dtype: int64

In [30]:
users.first_name.value_counts().head()

Mark           11
David          10
Michael         9
Christopher     7
William         7
Name: first_name, dtype: int64

### Sorting Information
This is done through the `.sort_values()` method, which changes based on what parameters I pass into it. I cover the parameters in the next couple of sections.

* Sort by Index (DEFAULT)
* Sort by Values (actually __creates a whole new DataFrame__ temporarily) 

Because of this quirk - that it creates a new DataFrame (think of a new DataFrame like a new excel document) - if I want to __permanently replace__ the sort from the default, I pass the following argument to the `variable_name.sort_values()` function: `, inplace=True`

#### Parameters
List:
* `by='column_name'` = selects the column that you'd like to sort. All the other columns will follow along. 
* `ascending=False` = sorts the highest (if integers) balance at the top. TEST: What happens if the column contains strings, and not integers? Will it still work? RESULT: YES! It sorts reverse alphabetically - so Z's will appear at the top. 

__NOTE:__ It seems like the DEFAULT for .sort_values() sorts things in descending order. Passing in ascending just reverses this trend. 

In [10]:
users.sort_values(by='balance', ascending=False).head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
twhite,Timothy,White,white5136@hotmail.com,True,7/6/18,5,99.9
karen.snow,Karen,Snow,ksnow@yahoo.com,True,5/6/18,2,99.38
king,Billy,King,billy.king@hotmail.com,True,5/29/18,4,98.8
king3246,Brittney,King,brittney@yahoo.com,True,4/15/18,6,98.79
crane203,Valerie,Crane,valerie7051@hotmail.com,True,5/12/18,3,98.69


In [11]:
# Control
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aaron,Aaron,Davis,aaron6348@gmail.com,True,8/31/18,6,18.14
acook,Anthony,Cook,cook@gmail.com,True,5/12/18,2,55.45
adam.saunders,Adam,Saunders,adam@gmail.com,False,5/29/18,3,72.12
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,4/28/18,3,30.01
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,6/16/18,7,25.85


In [14]:
# Test to see if .sort_values() works on strings
users.sort_values(by='email', ascending=False).head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
zachary.neal,Zachary,Neal,zneal@gmail.com,True,7/26/18,1,39.9
moore,Yvonne,Moore,yvonne5907@gmail.com,True,7/10/18,5,52.46
harold,Harold,Willis,willis9328@yahoo.com,False,5/7/18,1,6.0
nancy,Nancy,Williams,williams8331@gmail.com,True,5/14/18,4,41.15
daniel.williams,Daniel,Williams,williams6378@yahoo.com,True,3/13/18,4,93.85


In [15]:
# Test to see if .sort_values() works on booleans
users.sort_values(by='email_verified', ascending=False).head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aaron,Aaron,Davis,aaron6348@gmail.com,True,8/31/18,6,18.14
mays,Jacob,Mays,jmays@yahoo.com,True,3/1/18,4,83.85
matthew5486,Matthew,Weeks,matthew@hotmail.com,True,4/23/18,5,35.25
massey2102,Philip,Massey,philip@gmail.com,True,4/28/18,4,25.04
martin,Caroline,Martin,caroline@hotmail.com,True,8/30/18,3,16.55


In [32]:
# This command sorts first by last name, then by first name
users.sort_values(by=['last_name', 'first_name']).head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
darlene.adams,Darlene,Adams,adams@hotmail.com,True,9/15/18,2,67.02
lauren,Lauren,Aguilar,lauren.aguilar@summers.com,False,5/31/18,4,69.9
daniel,Daniel,Allen,allen@hotmail.com,False,7/1/18,2,21.21
kallen,Kathy,Allen,kathy@hotmail.com,False,2/20/18,1,43.72
alvarado,Denise,Alvarado,alvarado@hotmail.com,True,9/7/18,6,26.72


If I ever permanently change the DataFrame (i.e. save the new Excel sheet), I can switch it back to the original by entering in `variable_name.sort_index(inplace=True)`, and like always, I can add the `.head()` to view it. 

## Exercise Three - Selecting Data
The point of doing this is to select data __that meets a certain criteria__. This will use the `.loc` property. `.loc` accesses a group of rows and columns by their labels or a boolean array.<br>

Allowed inputs are:
* A single label - e.g. '5' or 'a'
* A list or array of labels in the form: ['a','b','c']
* A slice object with labels in the form: ['a':'f'] __NOTE: This is both inclusive and exclusive__, so a and f are __both__ included. 
* A callable function with one argument

A big part of manipulating data this way is that you can store a certain configuration in a variable and then manipulate that variable. I.e., creating a __new DataFrame__ and then returning the row as a series. See the documentation for more examples - it's a bit of an abstraction. 

In [33]:
# I want to return a new Seriees that I can manipulate later
# Putting it in a variable lets me do this. 
no_referrals_index = users['referral_count'] < 1
# The boolean 'Series' return includes all rows from the dataFrame.
# The value is the result of each comparison
# As a result, the variable's contents will be the result of a bool function
no_referrals_index.head()

aaron            False
acook            False
adam.saunders    False
adrian           False
adrian.blair     False
Name: referral_count, dtype: bool

In [34]:
# Now, if I want to retrieve all the rows where the comparison was true
#    all I have to do is use my newly created variable as my index :) 
users[no_referrals_index].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
alan9443,Alan,Pope,pope@hotmail.com,True,4/17/18,0,56.09
andrew.alvarez,Andrew,Alvarez,aalvarez@hotmail.com,False,8/1/18,0,81.66
boyer7005,Sara,Boyer,boyer8636@gmail.com,True,7/31/18,0,91.41
brandon.gilbert,Brandon,Gilbert,brandon.gilbert@hotmail.com,True,4/28/18,0,10.17
brooke2027,Brooke,,brooke6938@gmail.com,False,5/23/18,0,7.22


### Inverse Indexing
A shortcut to inverse an index is to preface it with a `~` beforehand. 

In [35]:
~no_referrals_index.head()

aaron            True
acook            True
adam.saunders    True
adrian           True
adrian.blair     True
Name: referral_count, dtype: bool

In [36]:
# Now I can use the shortcut to find where referral values DO NOT = 0
users[~no_referrals_index].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aaron,Aaron,Davis,aaron6348@gmail.com,True,8/31/18,6,18.14
acook,Anthony,Cook,cook@gmail.com,True,5/12/18,2,55.45
adam.saunders,Adam,Saunders,adam@gmail.com,False,5/29/18,3,72.12
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,4/28/18,3,30.01
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,6/16/18,7,25.85


### .loc Indexing
Basically, `.loc` property allows me to select exact values out of the DataFrame/excel sheet I've created. 

In [37]:
# Selects rows where there are no referrals
# AND select only the following ordered columns
users.loc[no_referrals_index, ['balance', 'email']].head()
# Remember that a ['col_name', 'col_name'] will order my values first by
#    the first column, then by the second. Not sure why sort_values wasn't used

Unnamed: 0,balance,email
alan9443,56.09,pope@hotmail.com
andrew.alvarez,81.66,aalvarez@hotmail.com
boyer7005,91.41,boyer8636@gmail.com
brandon.gilbert,10.17,brandon.gilbert@hotmail.com
brooke2027,7.22,brooke6938@gmail.com


In [38]:
# It's also possible to do the comparison inline, without storing the index
#    in a variable
users[users['referral_count'] == 0].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
alan9443,Alan,Pope,pope@hotmail.com,True,4/17/18,0,56.09
andrew.alvarez,Andrew,Alvarez,aalvarez@hotmail.com,False,8/1/18,0,81.66
boyer7005,Sara,Boyer,boyer8636@gmail.com,True,7/31/18,0,91.41
brandon.gilbert,Brandon,Gilbert,brandon.gilbert@hotmail.com,True,4/28/18,0,10.17
brooke2027,Brooke,,brooke6938@gmail.com,False,5/23/18,0,7.22


In [39]:
# I can also use bitwise operators to compare a boolean Series to another
#     boolean series.
# Select all users where they haven't made a referral AND their email
#     has been verified.
users[(users['referral_count'] == 0) & (users['email_verified'] == True)].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
alan9443,Alan,Pope,pope@hotmail.com,True,4/17/18,0,56.09
boyer7005,Sara,Boyer,boyer8636@gmail.com,True,7/31/18,0,91.41
brandon.gilbert,Brandon,Gilbert,brandon.gilbert@hotmail.com,True,4/28/18,0,10.17
bryant,Darlene,Bryant,dbryant@yahoo.com,True,7/19/18,0,36.91
calvin.perez,Calvin,Perez,cperez@gmail.com,True,2/17/18,0,13.01


In [40]:
users[(users['referral_count'] > 5) & (users['email_verified'] == True)].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aaron,Aaron,Davis,aaron6348@gmail.com,True,8/31/18,6,18.14
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,6/16/18,7,25.85
alvarado,Denise,Alvarado,alvarado@hotmail.com,True,9/7/18,6,26.72
alvarez,John,Alvarez,john4346@hotmail.com,True,9/18/18,6,49.62
angela7209,Angela,Collins,collins5797@yahoo.com,True,5/5/18,7,29.52
