# Working with Strings

### Exercise 1 strings
---
The pandas library has a similar set of string functions to those available in python generally.  Because we often want to perform operations on a whole series of data values in a dataframe, we can use pandas string functions to do this:

Let's use the data set 'Housing in London' at 'https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/housing_in_london_yearly_variables.csv

Read the dataset into a pandas dataframe and inspect the data.  The date, in this dataset is a string.   If we want to filter for a particular year, we will need to extract the first four letters as a substring.  We can create a new column called **year**, which just contains the year, stored as a number.

The date is written in the format yyyy-mm-dd.  We can split the year around the '-' and then use the first component, converting it to an integer

Reference:  

Series.str.split() *to split a column's strings into components*    
Series.str.get() *to get one of the components after the split*  

You can **daisychain** these together:   

`Series.str.split().str.get()`

Have a go

**Test output**:  

```
0       1999
1       1999
2       1999
3       1999
4       1999
        ... 
1066    2019
1067    2019
1068    2019
1069    2019
1070    2019
Name: year, Length: 1071, dtype: object
```

In [1]:
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/housing_in_london_yearly_variables.csv'
housing_in_london = pd.read_csv(url)
housing_in_london
housing_in_london['date'].str.split('-').str.get(0)

0       1999
1       1999
2       1999
3       1999
4       1999
        ... 
1066    2019
1067    2019
1068    2019
1069    2019
1070    2019
Name: date, Length: 1071, dtype: object

### Exercise 2
---

In exercise 1 you have extracted the year, but it's dtype is 'object' (it is still a string).  You can convert to integer by adding  .astype(int) to the daisychain.

**Test output**:  

```
...
Name: year, Length: 1071, dtype: int64
```



In [2]:
housing_in_london['date'].str.split('-').str.get(0).astype(int)

0       1999
1       1999
2       1999
3       1999
4       1999
        ... 
1066    2019
1067    2019
1068    2019
1069    2019
1070    2019
Name: date, Length: 1071, dtype: int32

### Exercise 3
---

All the areas in the data set are in lower case.  To prepare the data for reporting, you may want to capitalise.  Use .str.title() to do this.

In [3]:
housing_in_london['area'].str.title()

0             City Of London
1       Barking And Dagenham
2                     Barnet
3                     Bexley
4                      Brent
                ...         
1066           Great Britain
1067       England And Wales
1068        Northern Ireland
1069                Scotland
1070                   Wales
Name: area, Length: 1071, dtype: object

### Exercise 4 - Filter all areas to find all with 'and' in the name
---

Use str.contains() and a search (e.g. df[df['area'].str.contains()) to filter for all areas with 'and' in the name

**Test output**:  
105 rows × 13 columns


In [4]:
housing_in_london[housing_in_london['area'].str.contains(' and ')]


Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1
12,E09000013,hammersmith and fulham,1999-12-01,25000.0,,28555,7,160634.0,,,,1
19,E09000020,kensington and chelsea,1999-12-01,20646.0,,28074,13,147678.0,,,,1
35,E12000003,yorkshire and the humber,1999-12-01,16527.0,,18977,7,4956325.0,,,,0
47,K04000001,england and wales,1999-12-01,17974.0,,21549,,51933471.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1021,E09000002,barking and dagenham,2019-12-01,28738.0,,32010,,,,,,1
1032,E09000013,hammersmith and fulham,2019-12-01,37990.0,,48362,,,,,,1
1039,E09000020,kensington and chelsea,2019-12-01,33000.0,,41741,,,,,,1
1055,E12000003,yorkshire and the humber,2019-12-01,27835.0,,32653,,,,,,0


### Exercise 4
---

Filter the data for all areas starting with 'ba'  

Test output:  
21 rows, all city of london

In [5]:
housing_in_london[housing_in_london['area'].str.contains('ba')]

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3.0,162444.0,,,,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8.0,313469.0,,,,1
52,E09000002,barking and dagenham,2000-12-01,22618.0,,24696,4.0,163893.0,57000.0,,,1
53,E09000003,barnet,2000-12-01,21761.0,,25755,8.0,315784.0,138000.0,,,1
103,E09000002,barking and dagenham,2001-12-01,22323.0,,26050,3.0,165654.0,54000.0,3780.0,68298.0,1
104,E09000003,barnet,2001-12-01,20916.0,,26068,8.0,319481.0,138000.0,8675.0,130515.0,1
154,E09000002,barking and dagenham,2002-12-01,24813.0,,26653,3.0,166357.0,52000.0,3780.0,68526.0,1
155,E09000003,barnet,2002-12-01,23112.0,,30210,13.0,320552.0,135000.0,8675.0,130801.0,1
205,E09000002,barking and dagenham,2003-12-01,25358.0,,27792,5.0,166210.0,55000.0,3780.0,68837.0,1
206,E09000003,barnet,2003-12-01,23828.0,,30518,16.0,321802.0,138000.0,8675.0,131883.0,1


### Exercise 5
---
Filter the data for all areas ending with 'ham', for the year 2000

**Test output**:  
4 rows (barking and dagenham, hammersmith and fulham, lewisham, newham)  

In [6]:
housing_in_london[(housing_in_london['area'].str.endswith('ham')) & (housing_in_london['date'].str.contains('2000'))]

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
52,E09000002,barking and dagenham,2000-12-01,22618.0,,24696,4,163893.0,57000.0,,,1
63,E09000013,hammersmith and fulham,2000-12-01,25264.0,,28742,8,164393.0,120000.0,,,1
73,E09000023,lewisham,2000-12-01,22357.0,,22659,5,252106.0,76000.0,,,1
75,E09000025,newham,2000-12-01,19437.0,,21609,3,245463.0,79000.0,,,1


### Exercise 6 - new data set
---

Use the data set here:  https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true

Read the data from the sheet 'Skill Migration'  

Inspect the data, then create a new dataframe with the following changes:

1.  Remove the word 'Skills' from the 'skill_group_category' column  
2.  Convert country_code to uppercase  
3.  Filter for regions containing 'Asia'
4.  Remove the skill_group_id and the wb_income columns

**Test output**:  
9969 rows × 10 columns

In [25]:
import pandas as pd
url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true'
Skill_Migration = pd.read_excel(url,sheet_name = 'Skill Migration')
Skill_Migration.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17617 entries, 0 to 17616
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   country_code          17617 non-null  object 
 1   country_name          17617 non-null  object 
 2   wb_income             17617 non-null  object 
 3   wb_region             17617 non-null  object 
 4   skill_group_id        17617 non-null  int64  
 5   skill_group_category  17617 non-null  object 
 6   skill_group_name      17617 non-null  object 
 7   net_per_10K_2015      17617 non-null  float64
 8   net_per_10K_2016      17617 non-null  float64
 9   net_per_10K_2017      17617 non-null  float64
 10  net_per_10K_2018      17617 non-null  float64
 11  net_per_10K_2019      17617 non-null  float64
dtypes: float64(5), int64(1), object(6)
memory usage: 1.6+ MB


In [24]:
Skill_Migration['Skill_group_category'] = Skill_Migration['skill_group_category'].str.replace('Skills','')
Skill_Migration['country_code'] = Skill_Migration['country_code'].str.upper()
Skill_Migration = Skill_Migration.drop(['skill_group_id','wb_income'], axis=1)
Skill_Migration[Skill_Migration['wb_region'].str.contains('Asia')]




Unnamed: 0,country_code,country_name,wb_region,skill_group_category,skill_group_name,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019,Skill_group_category
0,AF,Afghanistan,South Asia,Tech Skills,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79,Tech
1,AF,Afghanistan,South Asia,Business Skills,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09,Business
2,AF,Afghanistan,South Asia,Specialized Industry Skills,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64,Specialized Industry
3,AF,Afghanistan,South Asia,Tech Skills,Software Testing,-957.50,-828.54,-964.73,-406.50,-739.51,Tech
4,AF,Afghanistan,South Asia,Specialized Industry Skills,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64,Specialized Industry
...,...,...,...,...,...,...,...,...,...,...,...
17371,VN,Vietnam,East Asia & Pacific,Specialized Industry Skills,Air Force,658.89,291.89,-271.00,219.20,166.99,Specialized Industry
17372,VN,Vietnam,East Asia & Pacific,Specialized Industry Skills,Cooking,83.81,236.02,-42.85,238.90,180.90,Specialized Industry
17373,VN,Vietnam,East Asia & Pacific,Business Skills,Sales Leads,511.16,98.13,-355.91,133.89,181.47,Business
17374,VN,Vietnam,East Asia & Pacific,Disruptive Tech Skills,Aerospace Engineering,410.02,42.71,-222.09,138.03,256.55,Disruptive Tech


### Exercise 7 - a data cleaning function

Create a function called **clean_data(df)** that perform the same actions as in Exercise 6, and will *return* the cleaned data set

Format:

```
def clean_data(df):
  # code to clean the data as in exerise 6, then return the final dataframe

cleaned_data = clean_data(migration_data)
```
**Test output:**  
9969 rows × 10 columns

In [26]:
def clean_data(df):
    df['Skill_group_category'] = df['skill_group_category'].str.replace('Skills','')
    df['country_code'] = df['country_code'].str.upper()
    df = df.drop(['skill_group_id','wb_income'], axis=1)
    return df[df['wb_region'].str.contains('Asia')]

clean_data(Skill_Migration)


Unnamed: 0,country_code,country_name,wb_region,skill_group_category,skill_group_name,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019,Skill_group_category
0,AF,Afghanistan,South Asia,Tech Skills,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79,Tech
1,AF,Afghanistan,South Asia,Business Skills,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09,Business
2,AF,Afghanistan,South Asia,Specialized Industry Skills,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64,Specialized Industry
3,AF,Afghanistan,South Asia,Tech Skills,Software Testing,-957.50,-828.54,-964.73,-406.50,-739.51,Tech
4,AF,Afghanistan,South Asia,Specialized Industry Skills,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64,Specialized Industry
...,...,...,...,...,...,...,...,...,...,...,...
17371,VN,Vietnam,East Asia & Pacific,Specialized Industry Skills,Air Force,658.89,291.89,-271.00,219.20,166.99,Specialized Industry
17372,VN,Vietnam,East Asia & Pacific,Specialized Industry Skills,Cooking,83.81,236.02,-42.85,238.90,180.90,Specialized Industry
17373,VN,Vietnam,East Asia & Pacific,Business Skills,Sales Leads,511.16,98.13,-355.91,133.89,181.47,Business
17374,VN,Vietnam,East Asia & Pacific,Disruptive Tech Skills,Aerospace Engineering,410.02,42.71,-222.09,138.03,256.55,Disruptive Tech


### Exercise 8
---

Write a function that will rename the net_per_10K_year columns to be just the year and will replace the 'z' (as in 'specialized') with 's' to Anglicise the spelling. The function should return the cleaned data.  

Hint:  for column names, you can get the name as a string, then use find() to see if 'net_per_10K_' is in the string (the result will not be -1 if it is there), then you can replace the column name

**Test output**:  
17617 rows × 12 columns, with z replace by s in Specialized  
Column names: country_code	country_name	wb_income	wb_region	skill_group_id	skill_group_category	skill_group_name	2015	2016	2017	2018	2019

In [None]:
def z_replace():
    

### Exercise 9
---

Read the 'Country Migration' sheet.

Write a function that will:  
*  convert the country codes to upper case  
*  drop the lat and long columns for both base and target  
*  rename the net_per_10K_year columns to year only  
*  filter for base_country_wb_region contains 'Africa' and target_country_wb_region contains Asia  

**Test output**:  
```
base_country_code	base_country_name	base_country_wb_income	base_country_wb_region	target_country_code	target_country_name	target_country_wb_income	target_country_wb_region	2015	2016	2017	2018	2019
0	AE	United Arab Emirates	High Income	Middle East & North Africa	AF	Afghanistan	Low Income	South Asia	0.19	0.16	0.11	-0.05	-0.02
4	AE	United Arab Emirates	High Income	Middle East & North Africa	AM	Armenia	Upper Middle Income	Europe & Central Asia	0.10	0.05	0.03	-0.01	0.02
5	AE	United Arab Emirates	High Income	Middle East & North Africa	AU	Australia	High Income	East Asia & Pacific	-1.06	-3.31	-4.01	-4.58	-4.09
6	AE	United Arab Emirates	High Income	Middle East & North Africa	AT	Austria	High Income	Europe & Central Asia	0.11	-0.08	-0.07	-0.05	-0.16
7	AE	United Arab Emirates	High Income	Middle East & North Africa	AZ	Azerbaijan	Upper Middle Income	Europe & Central Asia	0.24	0.25	0.10	0.05	0.04
...	...	...	...	...	...	...	...	...	...	...	...	...	...
4132	ZM	Zambia	Lower Middle Income	Sub-Saharan Africa	GB	United Kingdom	High Income	Europe & Central Asia	43.27	27.60	7.88	6.90	3.68
4135	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	AU	Australia	High Income	East Asia & Pacific	-1.31	-2.33	-2.10	-2.08	-1.84
4138	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	IS	Iceland	High Income	Europe & Central Asia	8.52	6.22	2.35	1.81	0.97
4142	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	NO	Norway	High Income	Europe & Central Asia	2.88	6.46	2.10	0.33	-0.13
4145	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	GB	United Kingdom	High Income	Europe & Central Asia	3.91	4.66	0.74	-0.66	-1.97
478 rows × 13 columns
```



### Exercise 10
---

Read the data from file 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv'.

Write a function that will return a new dataframe with just the married women listed, surname only.

**Test output**:  
```
	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings	female	38.0	1	0	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle	female	35.0	1	0	113803	53.1000	C123	S
8	9	1	3	Johnson	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser	female	14.0	1	0	237736	30.0708	NaN	C
15	16	1	2	Hewlett	female	55.0	0	0	248706	16.0000	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
871	872	1	1	Beckwith	female	47.0	1	1	11751	52.5542	D35	S
874	875	1	2	Abelson	female	28.0	1	0	P/PP 3381	24.0000	NaN	C
879	880	1	1	Potter	female	56.0	0	1	11767	83.1583	C50	C
880	881	1	2	Shelley	female	25.0	0	1	230433	26.0000	NaN	S
885	886	0	3	Rice	female	39.0	0	5	382652	29.1250	NaN	Q
129 rows × 12 columns
```



