# Working with Strings

### Exercise 1 strings
---
The pandas library has a similar set of string functions to those available in python generally.  Because we often want to perform operations on a whole series of data values in a dataframe, we can use pandas string functions to do this:

Let's use the data set 'Housing in London' at 'https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/housing_in_london_yearly_variables.csv

Read the dataset into a pandas dataframe and inspect the data.  The date, in this dataset is a string.   If we want to filter for a particular year, we will need to extract the first four letters as a substring.  We can create a new column called **year**, which just contains the year, stored as a number.

The date is written in the format yyyy-mm-dd.  We can split the year around the '-' and then use the first component, converting it to an integer

Reference:  

Series.str.split() *to split a column's strings into components*    
Series.str.get() *to get one of the components after the split*  

You can **daisychain** these together:   

`Series.str.split().str.get()`

Have a go

**Test output**:  

```
0       1999
1       1999
2       1999
3       1999
4       1999
        ... 
1066    2019
1067    2019
1068    2019
1069    2019
1070    2019
Name: year, Length: 1071, dtype: object
```

In [134]:
import pandas as pd
url = 'https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/housing_in_london_yearly_variables.csv'

housing_london = pd.read_csv(url)

housing_london["year"]= housing_london["date"].str.split("-").str.get(0)

display(housing_london["year"])

0       1999
1       1999
2       1999
3       1999
4       1999
        ... 
1066    2019
1067    2019
1068    2019
1069    2019
1070    2019
Name: year, Length: 1071, dtype: object

### Exercise 2
---

In exercise 1 you have extracted the year, but it's dtype is 'object' (it is still a string).  You can convert to integer by adding  .astype(int) to the daisychain.

**Test output**:  

```
...
Name: year, Length: 1071, dtype: int64
```



In [77]:
housing_london["year"]= housing_london["date"].str.split("-").str.get(0).astype(int)
display (housing_london['year'])
display (housing_london.head(5))

0       1999
1       1999
2       1999
3       1999
4       1999
        ... 
1066    2019
1067    2019
1068    2019
1069    2019
1070    2019
Name: year, Length: 1071, dtype: int32

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag,year
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,,1,1999
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1,1999
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,1,1999
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,1,1999
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,,1,1999


### Exercise 3
---

All the areas in the data set are in lower case.  To prepare the data for reporting, you may want to capitalise.  Use .str.title() to do this.

In [73]:
housing_london['area']= housing_london['area'].str.title()
display (housing_london.head(5))


Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag,year
0,E09000001,City Of London,1999-12-01,33020.0,,48922,0,6581.0,,,,1,1999
1,E09000002,Barking And Dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1,1999
2,E09000003,Barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,1,1999
3,E09000004,Bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,1,1999
4,E09000005,Brent,1999-12-01,18532.0,,20911,6,260317.0,,,,1,1999


### Exercise 4 - Filter all areas to find all with 'and' in the name
---

Use str.contains() and a search (e.g. df[df['area'].str.contains()) to filter for all areas with 'and' in the name

**Test output**:  
105 rows × 13 columns


In [80]:
filter_and = housing_london[housing_london['area'].str.contains(' and ')]
display (filter_and)

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag,year
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1,1999
12,E09000013,hammersmith and fulham,1999-12-01,25000.0,,28555,7,160634.0,,,,1,1999
19,E09000020,kensington and chelsea,1999-12-01,20646.0,,28074,13,147678.0,,,,1,1999
35,E12000003,yorkshire and the humber,1999-12-01,16527.0,,18977,7,4956325.0,,,,0,1999
47,K04000001,england and wales,1999-12-01,17974.0,,21549,,51933471.0,,,,0,1999
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1021,E09000002,barking and dagenham,2019-12-01,28738.0,,32010,,,,,,1,2019
1032,E09000013,hammersmith and fulham,2019-12-01,37990.0,,48362,,,,,,1,2019
1039,E09000020,kensington and chelsea,2019-12-01,33000.0,,41741,,,,,,1,2019
1055,E12000003,yorkshire and the humber,2019-12-01,27835.0,,32653,,,,,,0,2019


### Exercise 4
---

Filter the data for all areas starting with 'ba'  

Test output:  
21 rows, all city of london

In [81]:
filter_ba = housing_london[housing_london['area'].str.startswith('ba')]
display (filter_ba)

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag,year
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3.0,162444.0,,,,1,1999
2,E09000003,barnet,1999-12-01,19568.0,,23128,8.0,313469.0,,,,1,1999
52,E09000002,barking and dagenham,2000-12-01,22618.0,,24696,4.0,163893.0,57000.0,,,1,2000
53,E09000003,barnet,2000-12-01,21761.0,,25755,8.0,315784.0,138000.0,,,1,2000
103,E09000002,barking and dagenham,2001-12-01,22323.0,,26050,3.0,165654.0,54000.0,3780.0,68298.0,1,2001
104,E09000003,barnet,2001-12-01,20916.0,,26068,8.0,319481.0,138000.0,8675.0,130515.0,1,2001
154,E09000002,barking and dagenham,2002-12-01,24813.0,,26653,3.0,166357.0,52000.0,3780.0,68526.0,1,2002
155,E09000003,barnet,2002-12-01,23112.0,,30210,13.0,320552.0,135000.0,8675.0,130801.0,1,2002
205,E09000002,barking and dagenham,2003-12-01,25358.0,,27792,5.0,166210.0,55000.0,3780.0,68837.0,1,2003
206,E09000003,barnet,2003-12-01,23828.0,,30518,16.0,321802.0,138000.0,8675.0,131883.0,1,2003


### Exercise 5
---
Filter the data for all areas ending with 'ham', for the year 2000

**Test output**:  
4 rows (barking and dagenham, hammersmith and fulham, lewisham, newham)  

In [82]:
df = housing_london[(housing_london["area"].str.endswith("ham")) & (housing_london["date"].str.contains("2000"))]
display(df)


Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag,year
52,E09000002,barking and dagenham,2000-12-01,22618.0,,24696,4,163893.0,57000.0,,,1,2000
63,E09000013,hammersmith and fulham,2000-12-01,25264.0,,28742,8,164393.0,120000.0,,,1,2000
73,E09000023,lewisham,2000-12-01,22357.0,,22659,5,252106.0,76000.0,,,1,2000
75,E09000025,newham,2000-12-01,19437.0,,21609,3,245463.0,79000.0,,,1,2000


### Exercise 6 - new data set
---

Use the data set here:  https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true

Read the data from the sheet 'Skill Migration'  

Inspect the data, then create a new dataframe with the following changes:

1.  Remove the word 'Skills' from the 'skill_group_category' column  
2.  Convert country_code to uppercase  
3.  Filter for regions containing 'Asia'
4.  Remove the skill_group_id and the wb_income columns

**Test output**:  
9969 rows × 10 columns

In [120]:
# read data
url="https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true"
migration_data = pd.read_excel(url, sheet_name='Skill Migration')
#delete unnamed columns ~ inverts the selection ^ is a regrex statement to say column starts with 
# migration_data = migration_data.loc[:, ~migration_data.columns.str.contains('^Unnamed')]
migration_data.drop(migration_data.filter(regex="Unnamed"),axis=1, inplace=True)

#1. Remove the word 'Skills' from the 'skill_group_category' column 
migration_data['skill_group_category'] = migration_data['skill_group_category'].str.replace('Skills', '')#1
#2. Convert country_code to uppercase 
migration_data['country_code'] = migration_data['country_code'].str.upper()#2
#3. Filter for regions containing 'Asia' - needs to be in ""
migration_data = migration_data[migration_data['wb_region'].str.contains("Asia")]
#4. Remove the skill_group_id and the wb_income columns
migration_data = migration_data.drop(columns = ['skill_group_id', 'wb_income'])

display (migration_data)
#######

Unnamed: 0,country_code,country_name,wb_region,skill_group_category,skill_group_name,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
0,AF,Afghanistan,South Asia,Tech,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79
1,AF,Afghanistan,South Asia,Business,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09
2,AF,Afghanistan,South Asia,Specialized Industry,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64
3,AF,Afghanistan,South Asia,Tech,Software Testing,-957.50,-828.54,-964.73,-406.50,-739.51
4,AF,Afghanistan,South Asia,Specialized Industry,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64
...,...,...,...,...,...,...,...,...,...,...
17371,VN,Vietnam,East Asia & Pacific,Specialized Industry,Air Force,658.89,291.89,-271.00,219.20,166.99
17372,VN,Vietnam,East Asia & Pacific,Specialized Industry,Cooking,83.81,236.02,-42.85,238.90,180.90
17373,VN,Vietnam,East Asia & Pacific,Business,Sales Leads,511.16,98.13,-355.91,133.89,181.47
17374,VN,Vietnam,East Asia & Pacific,Disruptive Tech,Aerospace Engineering,410.02,42.71,-222.09,138.03,256.55


### Exercise 7 - a data cleaning function

Create a function called **clean_data(df)** that perform the same actions as in Exercise 6, and will *return* the cleaned data set

Format:

```
def clean_data(df):
  # code to clean the data as in exerise 6, then return the final dataframe

cleaned_data = clean_data(migration_data)
```
**Test output:**  
9969 rows × 10 columns

In [117]:
def clean_data(df):
    #delete unnamed columns ~ inverts the selection ^ is a regrex statement to say column starts with 
    df.drop(df.filter(regex="Unnamed"),axis=1, inplace=True)
    #1. Remove the word 'Skills' from the 'skill_group_category' column 
    df['skill_group_category'] = df['skill_group_category'].str.replace('Skills', '')#1
    #2. Convert country_code to uppercase 
    df['country_code'] = df['country_code'].str.upper()#2
    #3. Filter for regions containing 'Asia' - needs to be in ""
    df = df[df['wb_region'].str.contains("Asia")]
    #4. Remove the skill_group_id and the wb_income columns
    df = df.drop(columns = ['skill_group_id', 'wb_income'])
    display (df)
    
url="https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true"
migration_data = pd.read_excel(url, sheet_name='Skill Migration')
cleaned_data = clean_data(migration_data)

Unnamed: 0,country_code,country_name,wb_region,skill_group_category,skill_group_name,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
0,AF,Afghanistan,South Asia,Tech,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79
1,AF,Afghanistan,South Asia,Business,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09
2,AF,Afghanistan,South Asia,Specialized Industry,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64
3,AF,Afghanistan,South Asia,Tech,Software Testing,-957.50,-828.54,-964.73,-406.50,-739.51
4,AF,Afghanistan,South Asia,Specialized Industry,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64
...,...,...,...,...,...,...,...,...,...,...
17371,VN,Vietnam,East Asia & Pacific,Specialized Industry,Air Force,658.89,291.89,-271.00,219.20,166.99
17372,VN,Vietnam,East Asia & Pacific,Specialized Industry,Cooking,83.81,236.02,-42.85,238.90,180.90
17373,VN,Vietnam,East Asia & Pacific,Business,Sales Leads,511.16,98.13,-355.91,133.89,181.47
17374,VN,Vietnam,East Asia & Pacific,Disruptive Tech,Aerospace Engineering,410.02,42.71,-222.09,138.03,256.55


### Exercise 8
---

Write a function that will rename the net_per_10K_year columns to be just the year and will replace the 'z' (as in 'specialized') with 's' to Anglicise the spelling. The function should return the cleaned data.  

Hint:  for column names, you can get the name as a string, then use find() to see if 'net_per_10K_' is in the string (the result will not be -1 if it is there), then you can replace the column name

**Test output**:  
17617 rows × 12 columns, with z replace by s in Specialized  
Column names: country_code	country_name	wb_income	wb_region	skill_group_id	skill_group_category	skill_group_name	2015	2016	2017	2018	2019

In [148]:
import pandas as pd

# column_names =list(migration_data.columns)
def rename(df):
    #delete unnamed columns ~ inverts the selection ^ is a regrex statement to say column starts with 
    df.drop(df.filter(regex="Unnamed"),axis=1, inplace=True)
    # replace column strings
    df.columns = df.columns.str.replace("net_per_10K_", "")
    # update values using regular expression
    df.replace("Specialized","Specialised", regex=True, inplace=True)


url="https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true"
migration_data = pd.read_excel(url, sheet_name='Skill Migration')
rename(migration_data)  
# display(column_names)
migration_data.head()




# def clean_data(df):
    
#     df.str.replace('*net_per_10K_*', '', regex=True)
# # for column names, you can get the name as a string, then use find() to see if 'net_per_10K_' is in the string (the result will not be -1 if it is there), then you can replace the column name    
# #     df.columns = df.columns.astype(str)
# #     if df.str.find('net_per_10K_') == -1:
# #         df.str.replace('net_per_10K_', '')

#     return df

    
# clean_data(migration_data)

#display (migration_data.info())

Unnamed: 0,country_code,country_name,wb_income,wb_region,skill_group_id,skill_group_category,skill_group_name,2015,2016,2017,2018,2019
0,af,Afghanistan,Low income,South Asia,2549,Tech Skills,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79
1,af,Afghanistan,Low income,South Asia,2608,Business Skills,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09
2,af,Afghanistan,Low income,South Asia,3806,Specialised Industry Skills,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64
3,af,Afghanistan,Low income,South Asia,50321,Tech Skills,Software Testing,-957.5,-828.54,-964.73,-406.5,-739.51
4,af,Afghanistan,Low income,South Asia,1606,Specialised Industry Skills,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64


### Exercise 9
---

Read the 'Country Migration' sheet.

Write a function that will:  
*  convert the country codes to upper case  
*  drop the lat and long columns for both base and target  
*  rename the net_per_10K_year columns to year only  
*  filter for base_country_wb_region contains 'Africa' and target_country_wb_region contains Asia  

**Test output**:  
```
base_country_code	base_country_name	base_country_wb_income	base_country_wb_region	target_country_code	target_country_name	target_country_wb_income	target_country_wb_region	2015	2016	2017	2018	2019
0	AE	United Arab Emirates	High Income	Middle East & North Africa	AF	Afghanistan	Low Income	South Asia	0.19	0.16	0.11	-0.05	-0.02
4	AE	United Arab Emirates	High Income	Middle East & North Africa	AM	Armenia	Upper Middle Income	Europe & Central Asia	0.10	0.05	0.03	-0.01	0.02
5	AE	United Arab Emirates	High Income	Middle East & North Africa	AU	Australia	High Income	East Asia & Pacific	-1.06	-3.31	-4.01	-4.58	-4.09
6	AE	United Arab Emirates	High Income	Middle East & North Africa	AT	Austria	High Income	Europe & Central Asia	0.11	-0.08	-0.07	-0.05	-0.16
7	AE	United Arab Emirates	High Income	Middle East & North Africa	AZ	Azerbaijan	Upper Middle Income	Europe & Central Asia	0.24	0.25	0.10	0.05	0.04
...	...	...	...	...	...	...	...	...	...	...	...	...	...
4132	ZM	Zambia	Lower Middle Income	Sub-Saharan Africa	GB	United Kingdom	High Income	Europe & Central Asia	43.27	27.60	7.88	6.90	3.68
4135	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	AU	Australia	High Income	East Asia & Pacific	-1.31	-2.33	-2.10	-2.08	-1.84
4138	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	IS	Iceland	High Income	Europe & Central Asia	8.52	6.22	2.35	1.81	0.97
4142	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	NO	Norway	High Income	Europe & Central Asia	2.88	6.46	2.10	0.33	-0.13
4145	ZW	Zimbabwe	Low Income	Sub-Saharan Africa	GB	United Kingdom	High Income	Europe & Central Asia	3.91	4.66	0.74	-0.66	-1.97
478 rows × 13 columns
```



In [130]:
url="https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true"
country_migration = pd.read_excel(url, sheet_name='Country Migration')


   # convert the country codes to upper case
#df['country_code'] = df['country_code'].str.upper()
   # drop the lat and long columns for both base and target
country_migration[country_migration.str.contains('_lat','_long' ).drop]
#df.drop(labels='weight', level=1)
    
    #rename the net_per_10K_year columns to year only
    
    #filter for base_country_wb_region contains 'Africa' and target_country_wb_region contains Asia
#df = df[df['base_country_wb_region'].str.contains("Africa") and df[df['target_country_wb_region'].str.contains("Asia")]





NameError: name 'column' is not defined

### Exercise 10
---

Read the data from file 'https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv'.

Write a function that will return a new dataframe with just the married women listed, surname only.

**Test output**:  
```
	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings	female	38.0	1	0	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle	female	35.0	1	0	113803	53.1000	C123	S
8	9	1	3	Johnson	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser	female	14.0	1	0	237736	30.0708	NaN	C
15	16	1	2	Hewlett	female	55.0	0	0	248706	16.0000	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
871	872	1	1	Beckwith	female	47.0	1	1	11751	52.5542	D35	S
874	875	1	2	Abelson	female	28.0	1	0	P/PP 3381	24.0000	NaN	C
879	880	1	1	Potter	female	56.0	0	1	11767	83.1583	C50	C
880	881	1	2	Shelley	female	25.0	0	1	230433	26.0000	NaN	S
885	886	0	3	Rice	female	39.0	0	5	382652	29.1250	NaN	Q
129 rows × 12 columns
```





In [196]:
import pandas as pd

def married_surnames(df):
    """"Assume married women are all Mrs in Name"""
    df = df[df['Name'].str.contains("Mrs", regex=False)]
#     df.loc[df['Name'].str.contains("Mrs", regex=False)]
    # Get just first part of cell to get surname
#     df['Name'].str.split(',').str.get(0)
#     display(df)
# df['make'] = df.id.str.split().str.get(0)
    


url="https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
dataset = pd.read_csv(url)
# # below works outwith function
# dataset = dataset[dataset['Name'].str.contains("Mrs", regex=False)]
# # Get just first part of cell to get surname
# dataset['Name'] = dataset['Name'].str.split(',').str[0]

married_surnames(dataset)  
# display(column_names)
dataset.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
