# How to Manipulate Text Using Pandas
This might be helpful for my spellchecker problems and data cleaning with NLTK.<br>

Using __vectorization__ allows me to use vectorized string methods in a property named `str`. 

In [11]:
import os
import pandas as pd
from utils import make_chaos

In [12]:
pd.options.display.max_rows = 10
users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
transactions = pd.read_csv(os.path.join('data', 'transactions.csv'), index_col=0)

# freaky stuff
make_chaos(transactions, 42, ['sender'], lambda val: '$' + val)
make_chaos(transactions, 88, ['receiver'], lambda val: val.upper())

# sanity check
(len(users), len(transactions))

(475, 998)

## Exercise One - Simple Data Cleaning
This is really good if there are a lot of values (like common nomenclature) that you want to change after making a decision down the line. I was just thinking that this would be the kind of perfect data tool that a messy organization like the Human Rights Watch might need. 

In this, I'll demonstrate two data cleaning methods:
0. Identifying a series using `DataFrame.colName.str.startswith()` as a crossreference Boolean index.
1. Replacing Text using `Series.str.replace()`
2. Changing Case using `Series.str.isupper()`

Not referenced in this example, but if I wanted to change a column's names to have first letter uppercase, rest of the letters lowercase, I could do that with the `Series.colName.str.title()` method. 

In [13]:
# creates a boolean series that I can use as an index
transactions[transactions.sender.str.startswith('$')]

Unnamed: 0,sender,receiver,amount,sent_date
59,$porter,gail7896,75.16,5/14/18
70,$emily.lewis,kevin,5.49,5/21/18
158,$robinson,rodriguez,8.91,6/25/18
168,$nancy,margaret265,84.15,6/26/18
198,$acook,adam.saunders,9.31,7/4/18
...,...,...,...,...
877,$Apr-82,jacob.davis,50.37,9/21/18
889,$victor,anthony1788,39.06,9/21/18
900,$andersen,corey.ingram,4.81,9/22/18
927,$janet.williams,bsmith,50.15,9/23/18


In [14]:
# replaces all "$" in the sender column with an empty string
#    essentially deleting it
transactions.sender = transactions.sender.str.replace('$', '')
# verfies that I got them all
len(transactions[transactions.sender.str.startswith('$')])

0

In [15]:
# like before, creating a boolean index I can use to sort
transactions[transactions.receiver.str.isupper()]

Unnamed: 0,sender,receiver,amount,sent_date
2,rose.eaton,EMILY.LEWIS,62.67,2/15/18
5,francis.hernandez,LMOORE,91.46,3/14/18
14,palmer,CHAD.CHEN,36.27,4/7/18
28,elang,DONNA1922,26.07,4/23/18
34,payne,GRIFFIN4992,85.21,4/26/18
...,...,...,...,...
963,stanley7729,JOSEPH.LOPEZ,50.84,9/25/18
977,martha6969,PATRICIA,87.33,9/25/18
987,alvarado,PAMELA,48.74,9/25/18
990,robert,HEATHER.WADE,86.44,9/25/18


In [16]:
# updates the receiver column of the specific rows that are uppercased
transactions.loc[transactions.receiver.str.isupper(), 'receiver'] = transactions.receiver.str.lower()
# verifies that I got them all
len(transactions[transactions.receiver.str.isupper()])

0

See [this](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) documentation on string handling to learn more. 

## Exercise Two - Grouping
This covers how to find an aggregate view of a certain value

In [17]:
transactions.dtypes

sender        object
receiver      object
amount       float64
sent_date     object
dtype: object

Grouping is pretty straightforward - I want to group by the reciever, so use the `DataFrame.groupby()` method:

In [20]:
grouped_by_receiver = transactions.groupby('receiver')

# see what type of object I get back:
type(grouped_by_receiver)

pandas.core.groupby.generic.DataFrameGroupBy

Now that I have my groupBy object, I can do a lot of things to it. 

In [21]:
# returns a series of total number of rows
grouped_by_receiver.size()

receiver
Apr-82           2
aaron            6
acook            1
adam.saunders    2
adrian           3
                ..
wilson           2
wking            2
wright3590       4
young            2
zachary.neal     4
Length: 410, dtype: int64

In [22]:
# returns the counts of how many non-missing data points I have across
#    column in my group:
grouped_by_receiver.count()

Unnamed: 0_level_0,sender,amount,sent_date
receiver,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apr-82,2,2,2
aaron,6,6,6
acook,1,1,1
adam.saunders,2,2,2
adrian,3,3,3
...,...,...,...
wilson,2,2,2
wking,2,2,2
wright3590,4,4,4
young,2,2,2


In [23]:
# the sum method allows me to see each numeric column summed up
grouped_by_receiver.sum()

Unnamed: 0_level_0,amount
receiver,Unnamed: 1_level_1
Apr-82,88.89
aaron,366.15
acook,94.65
adam.saunders,101.15
adrian,124.36
...,...
wilson,44.39
wking,74.07
wright3590,195.45
young,83.57


In [25]:
# now I want to create a new column to see who received the most transacts
users['transaction_count'] = grouped_by_receiver.size()

# now I'll see what missing data I have
len(users[users.transaction_count.isna()])

65

In [26]:
# since I don't have a transaction record for everyone, not every user
#     will be in my grouping. So I have to fix the .NaNs
users.transaction_count.fillna(0, inplace=True)
users

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance,transaction_count
aaron,Aaron,Davis,aaron6348@gmail.com,True,8/31/18,6,18.14,6.0
acook,Anthony,Cook,cook@gmail.com,True,5/12/18,2,55.45,1.0
adam.saunders,Adam,Saunders,adam@gmail.com,False,5/29/18,3,72.12,2.0
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,4/28/18,3,30.01,3.0
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,6/16/18,7,25.85,7.0
...,...,...,...,...,...,...,...,...
wilson,Robert,Wilson,robert@yahoo.com,False,5/16/18,5,59.75,2.0
wking,Wanda,King,wanda.king@holt.com,True,6/1/18,2,67.08,2.0
wright3590,Jacqueline,Wright,jacqueline.wright@gonzalez.com,True,2/8/18,6,18.48,4.0
young,Jessica,Young,jessica4028@yahoo.com,True,7/17/18,4,75.39,2.0


In [28]:
# now I don't like the fact that my dtype = float64, so I'll change that
users.transaction_count = users.transaction_count.astype('int64')
users

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance,transaction_count
aaron,Aaron,Davis,aaron6348@gmail.com,True,8/31/18,6,18.14,6
acook,Anthony,Cook,cook@gmail.com,True,5/12/18,2,55.45,1
adam.saunders,Adam,Saunders,adam@gmail.com,False,5/29/18,3,72.12,2
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,4/28/18,3,30.01,3
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,6/16/18,7,25.85,7
...,...,...,...,...,...,...,...,...
wilson,Robert,Wilson,robert@yahoo.com,False,5/16/18,5,59.75,2
wking,Wanda,King,wanda.king@holt.com,True,6/1/18,2,67.08,2
wright3590,Jacqueline,Wright,jacqueline.wright@gonzalez.com,True,2/8/18,6,18.48,4
young,Jessica,Young,jessica4028@yahoo.com,True,7/17/18,4,75.39,2


In [30]:
# now I'll sort my new column by descending values to get the highest
users.sort_values(
    ['transaction_count', 'first_name'],
    ascending=[False, True],
    inplace=True
)
# now I want to see the top 10 recievers, showing only the columns I want
users.loc[:, ['first_name', 'last_name', 'email', 'transaction_count']].head(10)

Unnamed: 0,first_name,last_name,email,transaction_count
scott3928,Scott,,scott@yahoo.com,9
sfinley,Samuel,Finley,samuel@gmail.com,8
adrian.blair,Adrian,Blair,adrian9335@gmail.com,7
hdeleon,Hannah,Deleon,hannah@yahoo.com,7
miranda6426,Miranda,Rogers,miranda.rogers@gmail.com,7
aaron,Aaron,Davis,aaron6348@gmail.com,6
corey,Corey,Fuller,fuller8100@yahoo.com,6
heather,Heather,Ray,hray@yahoo.com,6
jennifer.hebert,Jennifer,Hebert,jennifer.hebert@yahoo.com,6
edwards,Michael,Edwards,edwards5456@gmail.com,6
