<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/28Mar20_6_coercing_strings_in_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coercing Strings in Pandas

### Introduction

Now so far, we have seen how to coerce data by modifying the datatype with `astype`, by removing missing data, and by coercing data with the `map` function.  In this lesson, we will see some techniques for working with string data in pandas.

### Working with Strings

Let's use pandas to scrape some data from ESPN about the roster of the Houston Rockets.

In [0]:
import pandas as pd
hou_dfs = pd.read_html("https://www.espn.com/nba/team/roster/_/name/hou")

In [0]:
hou_df = hou_dfs[-1].loc[:, 'Name':]

In [0]:
hou_df[:3]

Unnamed: 0,Name,POS,Age,HT,WT,College,Salary
0,Bruno Caboclo5,SF,24,"6' 9""",218 lbs,--,"$1,845,301"
1,DeMarre Carroll9,SF,33,"6' 6""",215 lbs,Missouri,"$512,721"
2,Tyson Chandler19,C,37,"7' 0""",235 lbs,--,"$1,620,564"


Now let's look to see what datatypes we have.

In [0]:
hou_df.dtypes

Name       object
POS        object
Age         int64
HT         object
WT         object
College    object
Salary     object
dtype: object

In the data above, all of the columns except for age are of type of object, but many of them, height, weight, and salary could be converted into numbers.

In addition, we also see that each player's name has a number at the of it, likely the player's number. 

### Cleaning Data

Let's start by trying to turn the weight column into a number.  We cannot just use `pd.to_numeric` here, because of the `lb` characters at the end.  Luckily for us, the `pandas.str` methods can quickly help us clean up this data.

We can access the string methods by going to any series of type `object` and then typing str.

In [0]:
hou_df['WT'].str.

<pandas.core.strings.StringMethods at 0x121ea8450>

From here, we can simply use tab completion to see a list of methods, or we can browse the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html).

Now there are a couple of ways that we can clean up this particular series.  The first is to simply slice all but the last three characters, as those are non-numeric.

In [0]:
hou_df['WT'][:2]

0    218 lbs
1    215 lbs
Name: WT, dtype: object

In [0]:
all_but_last_three = hou_df['WT'].str[:-3]
all_but_last_three[:4]

0    218 
1    215 
2    235 
3    180 
Name: WT, dtype: object

Or, we could use replace, to replace ` lbs` with an empty string.

In [0]:
replaced = hou_df['WT'].str.replace(' lbs', '')
replaced[:2]

0    218
1    215
Name: WT, dtype: object

Another useful method is the `split` method, let's try using that, splitting on the space.

In [0]:
split_wt = hou_df['WT'].str.split()
split_wt[:3]

0    [218, lbs]
1    [215, lbs]
2    [235, lbs]
Name: WT, dtype: object

And from there, can use map to select the last element from each item.

In [0]:
wts = split_wt.map(lambda x: x[0])
wts[:3]

0    218
1    215
2    235
Name: WT, dtype: object

### Summary

In this lesson we saw that we can access string methods by accessing a series of type object and then typing `.str`.  From there, we saw that we can `slice` our strings, use the `split` method, or use the `replace` method to modify our data.

This will. help convert data to numbers in the future.

### Resources

[pandas string methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)

In [0]:
hou_df['WT'].str.extract(r'(\d*)', expand=False)

0     218
1     215
2     235
3     180
4     209
5     200
6     215
7     235
8     220
9     250
10    220
11    207
12    195
13    200
14    215
15    245
16    200
Name: WT, dtype: object