# Data Manipulation using Pandas
DataFrames are *mutable* which means that I can manipulate them. Look at the following for various ways that I can alter the shape and change data in my DataFrame.

In [3]:
from datetime import datetime
import os

import numpy as np
import pandas as pd

In [5]:
users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)
# check if everything worked:
(users.shape, transactions.shape)

((475, 7), (999, 4))

## Exercise One - Assigning Values
Chaining values is __not__ the way to go about doing this, instead use the `.loc` property to locate the row and specific column to update.<br>

See what I mean by chaining values below:
1. First, make sure there's only one entry matching the data I'd like to alter.

In [6]:
users[(users.first_name == "Adrian") & (users.last_name == "Fang")]

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,4/28/18,3,30.01


2. Then, I want to update the balance. The problem scenario I was given was that a 'user' called and said there was an error in his 'balance', it should be '35' but 'is currently '30.01'. 
Apparently the common thought process is to chain off the returned DataFrame above - but when I try and set it to the new value, I'll get a `SettingWithCopyWarning` that I don't want. 

In [8]:
users[(users.first_name == "Adrian") & (users.last_name == "Fang")]['balance']

adrian    30.01
Name: balance, dtype: float64

In [9]:
users[(users.first_name == "Adrian") & (users.last_name == "Fang")]['balance'] = 35.00

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


### New Method - Don't Chain and Assign, Instead `.loc` it
Below, I locate the specific row and column using `.loc` and update it. 

In [10]:
users.loc[(users.first_name == "Adrian") & (users.last_name == "Fang"), 'balance'] = 35.00
users.loc['adrian']

first_name                               Adrian
last_name                                  Fang
email             adrian.fang@teamtreehouse.com
email_verified                             True
signup_date                             4/28/18
referral_count                                3
balance                                      35
Name: adrian, dtype: object

That was a bit complicated to type out - so instead, I can also use the `.at` method as a shortcut.

In [13]:
# remember, Pandas and NumPy work by [row index, column name]
users.at['adrian', 'balance'] = 45.00
users.loc['adrian']

first_name                               Adrian
last_name                                  Fang
email             adrian.fang@teamtreehouse.com
email_verified                             True
signup_date                             4/28/18
referral_count                                3
balance                                      45
Name: adrian, dtype: object

In [15]:
users.at['adrian', 'balance'] = 35.00
users.loc['adrian']

first_name                               Adrian
last_name                                  Fang
email             adrian.fang@teamtreehouse.com
email_verified                             True
signup_date                             4/28/18
referral_count                                3
balance                                      35
Name: adrian, dtype: object

## Exercise Two - Adding Rows
In this exercise, I have another .csv file in the directory called transactions.csv that I want to change (based on the scenario that I'm trying to emulate).<br> 

First, I've opened the csv in pandas above, but I'll do it again here:

In [53]:
# setting the DataFrame I'm creating to a variable
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'),
                           index_col=0)
transactions.tail()

Unnamed: 0,sender,receiver,amount,sent_date
994.0,king3246,john,25.37,9/25/18
995.0,shernandez,kristen1581,75.77,9/25/18
996.0,leah6255,jholloway,63.62,9/25/18
997.0,pamela,michelle4225,2.54,9/25/18
,,,,


In [54]:
# getting rid of the NaN value up here because it interferes below
transactions.reset_index(inplace=True)

In [55]:
transactions.tail()

Unnamed: 0,index,sender,receiver,amount,sent_date
994,994.0,king3246,john,25.37,9/25/18
995,995.0,shernandez,kristen1581,75.77,9/25/18
996,996.0,leah6255,jholloway,63.62,9/25/18
997,997.0,pamela,michelle4225,2.54,9/25/18
998,,,,,


In [56]:
transactions.drop(index=[998], inplace=True)
transactions.tail()

Unnamed: 0,index,sender,receiver,amount,sent_date
993,993.0,coleman,sarah.evans,36.29,9/25/18
994,994.0,king3246,john,25.37,9/25/18
995,995.0,shernandez,kristen1581,75.77,9/25/18
996,996.0,leah6255,jholloway,63.62,9/25/18
997,997.0,pamela,michelle4225,2.54,9/25/18


In [57]:
# now I want to build a new record:
record = dict(sender=np.nan, receiver='adrian', amount='4.99', 
              sent_date=datetime.now().date())

### Appending w/  `Dataframe.append`
The `DataFrame.append()` adds a new row to the dataset. This method __doesn't change the original dataset, but returns a copy of the DataFrame with the new row(s) appended__.<br>

Something to note - the index for `transactions.csv` is autoassigned, so I need to pass in the `ignore_index=True` to generate it.<br>

The `.tail()` method is the opposite of `.head()`, and shows the last five tuples in the DataFrame vs. the first five.

In [58]:
transactions.append(record, ignore_index=True).tail()
# note that the first time I tried this, I had a typo in 'receiver', so 
#     it added another column indexed with the typo. 

Unnamed: 0,index,sender,receiver,amount,sent_date
994,994.0,king3246,john,25.37,9/25/18
995,995.0,shernandez,kristen1581,75.77,9/25/18
996,996.0,leah6255,jholloway,63.62,9/25/18
997,997.0,pamela,michelle4225,2.54,9/25/18
998,,,adrian,4.99,2019-05-24


If I'm appending multiple rows, it's more effective to use the `pd.concat()` method. See the documentation for how to do that. 

### Setting w/ Enlargement
If I assign to a non-existent index key - the DataFrame will be enlarged automatically. <br>

Now, what happens if the index in my .csv isn't autogenerated? There's a popular workaround, which involves using the last used index and incrementing it. 

## Exercise Three - Adding Columns
Adding columns is just like adding rows, and missing values will also be set to NaN.

In [60]:
latest_id = transactions.index.max()
# add a new column named notes
transactions.at[latest_id, 'notes'] = 'Adrian called customer support to report a billing error'
transactions.tail()

Unnamed: 0,index,sender,receiver,amount,sent_date,notes
993,993.0,coleman,sarah.evans,36.29,9/25/18,
994,994.0,king3246,john,25.37,9/25/18,
995,995.0,shernandez,kristen1581,75.77,9/25/18,
996,996.0,leah6255,jholloway,63.62,9/25/18,
997,997.0,pamela,michelle4225,2.54,9/25/18,Adrian called customer support to report a bil...


#### NOTE: Deletion and indexing DOES NOT work on NaN indexes - they are an error when saving the .csv file. 
The following code is a workaround because I couldn't drop the 998th row because it was indexed with a NaN.  

So it seems like reset_index, if I put `.tail()` right after it, it'll show me __both the new index AND the old one__, BUT, it'll delete the old one the minute I write up a new function.<br>

For some reason, now it's not doing the default values with the `.tail()` anymore and just giving me the whole thing. It also 

In [61]:
# This adds a new column named large through an expression
transactions['large'] = transactions.amount > 70
transactions.head() 

Unnamed: 0,index,sender,receiver,amount,sent_date,notes,large
0,0.0,stein,smoyer,49.03,1/24/18,,False
1,1.0,holden4580,joshua.henry,34.64,2/6/18,,False
2,2.0,rose.eaton,emily.lewis,62.67,2/15/18,,False
3,3.0,lmoore,kallen,1.94,3/5/18,,False
4,4.0,scott3928,lmoore,27.82,3/10/18,,False


### Renaming Columns
Renaming columns can be done through the `DataFrame.rename` method. I specify the current name(s) as key(s) and the new name(s) as the value(s).

By default this returns a copy, but I can use the `inplace=True` argument to change the existing dataframe. I haven't proven whether or not inplace actually changes the .csv saved to my computer, but it at least changes the DataFrame I've created in this file. 

In [62]:
transactions.rename(columns={'large': 'big_sender'}, inplace=True)
transactions.head()

Unnamed: 0,index,sender,receiver,amount,sent_date,notes,big_sender
0,0.0,stein,smoyer,49.03,1/24/18,,False
1,1.0,holden4580,joshua.henry,34.64,2/6/18,,False
2,2.0,rose.eaton,emily.lewis,62.67,2/15/18,,False
3,3.0,lmoore,kallen,1.94,3/5/18,,False
4,4.0,scott3928,lmoore,27.82,3/10/18,,False


## Exercise Four - Deletion 
Just follows the pattern `varName.drop(columns=['colName], axis='columns', inplace=True)`<br>

Rows follows the same pattern, but can usually be defined by the index or column name that's at the front indexing it:
`varName.drop(index=[#], inplace=True)` <br>

I can also do this with a variable's name in the place of a number. 

In [63]:
transactions.drop(['big_sender'], axis='columns', inplace=True)
transactions.head()

Unnamed: 0,index,sender,receiver,amount,sent_date,notes
0,0.0,stein,smoyer,49.03,1/24/18,
1,1.0,holden4580,joshua.henry,34.64,2/6/18,
2,2.0,rose.eaton,emily.lewis,62.67,2/15/18,
3,3.0,lmoore,kallen,1.94,3/5/18,
4,4.0,scott3928,lmoore,27.82,3/10/18,


In [64]:
last_key = transactions.index.max()
transactions.drop(index=[last_key], inplace=True)
transactions.tail()

Unnamed: 0,index,sender,receiver,amount,sent_date,notes
992,992.0,pamela,caleb,25.01,9/25/18,
993,993.0,coleman,sarah.evans,36.29,9/25/18,
994,994.0,king3246,john,25.37,9/25/18,
995,995.0,shernandez,kristen1581,75.77,9/25/18,
996,996.0,leah6255,jholloway,63.62,9/25/18,


### Use Case Exercise

In [68]:
# She exists!
# users[(users.email == "kimberly@yahoo.com")]
# but how many Kimberly's are there?
users[(users.index == "kimberly")]
# answer: only one

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
kimberly,Kimberly,,kimberly@yahoo.com,False,1/6/18,5,54.73


In [69]:
users.at['kimberly', 'last_name'] = "Deal"
users[(users.index == "kimberly")]

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
kimberly,Kimberly,Deal,kimberly@yahoo.com,False,1/6/18,5,54.73


In [70]:
users[(users.index == "jeffrey")]

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
jeffrey,Jeffrey,Stewart,stewart7222@hotmail.com,True,1/2/18,0,40.58


In [84]:
users.at['jefrey', 'first_name'] = "Jefrey"
users.at['jefrey', 'last_name'] = "Stewart"
users.at['jefrey', 'email'] = "stewart7222@hotmail.com"
users.at['jefrey', 'email_verified'] = True
users.at['jefrey', 'signup_date'] = datetime.date(1, 2, 18)
users[(users.index == "jefrey")]

TypeError: descriptor 'date' requires a 'datetime.datetime' object but received a 'int'