# String Transformations

## Outline

1. Basic string operations<br>
    a. Managing case and whitespace<br>
2. Standard string operations in `pyspark`
    * REPLACE
    * SPLIT and GET
    * EXTRACT
    * RECODE

In [58]:
from pyspark.sql import SparkSession
from more_pyspark import get_spark_types, to_pandas

spark = SparkSession.builder.appName('Ops').getOrCreate()

# Working with Strings

## Data set

We will be using two of the data sets provided by the Museam of Modern Art (MoMA) in this lecture.  Make sure that you have downloaded each repository.  [Download Instructions](./get_MOMA_data.ipynb)

In [59]:
from pyspark.sql import SparkSession
from more_pyspark import get_spark_types, to_pandas

spark = SparkSession.builder.appName('Ops').getOrCreate()

artists = spark.read.csv("./data/Artists.csv", inferSchema=True, header=True)
artwork = spark.read.csv("./data/Artworks.csv", inferSchema=True, header=True)

In [60]:
artists.take(5) >> to_pandas

Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN
0,1,Robert Arneson,"American, 1930–1992",American,Male,1930,1992,,
1,2,Doroteo Arnaiz,"Spanish, born 1936",Spanish,Male,1936,0,,
2,3,Bill Arnold,"American, born 1941",American,Male,1941,0,,
3,4,Charles Arnoldi,"American, born 1946",American,Male,1946,0,Q1063584,500027998.0
4,5,Per Arnoldi,"Danish, born 1941",Danish,Male,1941,0,,


In [61]:
artwork.take(5) >> to_pandas

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,Photographic reproduction with colored synthet...,...,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.8,,,50.8,,
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, color pencil, ink, and gouache on tr...",...,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,,,,38.4,,,19.1,,


## Managing case and whitespace

The following table compare the functions/methods for managing case

| `python` method | `pandas` method | `pyspark` function |
| --- | --- | --- |
| `s.lower()` | `df.a.str.lower()` | `lower(df.a)` | 
| `s.upper()` | `df.a.str.upper()` | `upper(df.a)` | 
| `s.strip()` | `df.a.str.strip()` | `trim(df.a)` | 
| `s.lstrip()` | `df.a.str.lstrip()` | `ltrim(df.a)` | 
| `s.rstrip()` | `df.a.str.rstrip()` | `rtrim(df.a)` | 

## Example - Lower-case Artists

In [62]:
from pyspark.sql.functions import lower
from more_pyspark import to_pandas

(artwork
 .select(artwork.Artist)
 .withColumn('Artist', lower(artwork.Artist))
 .take(2)) >> to_pandas

Unnamed: 0,Artist
0,otto wagner
1,christian de portzamparc


## REPLACE

Another important string operation involves replacing one substring with another.  In this section, we will illustrate using a regular expression to accomplish this task.

### Example - The BeginDate mess

In [63]:
(artwork
.select('BeginDate')
.take(5)
) >> to_pandas

Unnamed: 0,BeginDate
0,(1841)
1,(1944)
2,(1876)
3,(1944)
4,(1876)


### Using a regular expression in `pyspark`

`pyspark` provides `regex_replace`, which offers the same functionality.

In [64]:
from pyspark.sql.functions import regexp_replace

(artwork
 .select(['BeginDate'])
 .withColumn('BeginDate', regexp_replace(artwork.BeginDate, r'[()]', ''))
 .take(3)) >> to_pandas

Unnamed: 0,BeginDate
0,1841
1,1944
2,1876


## SPLIT & GET

The third basic string operation consists of splitting strings and extracting the resulting parts. In this section, we will first highlight the split methods/functions, then highlight two useful `dfply` functions for splitting up text columns.

### Splitting strings

In `pyspark`, we can split using `split` from `pyspark.sql.functions`
    
    * Syntax: `split(df.c, pattern)`
    * Accepts regular expressions

### Example 1 - Splitting the Artists Name

In [65]:
from pyspark.sql.functions import split

(artwork
 .select(artwork.Artist)
 .withColumn('names', split(artwork.Artist, ' '))
 .take(2)) >> to_pandas

Unnamed: 0,Artist,names
0,Otto Wagner,"[Otto, Wagner]"
1,Christian de Portzamparc,"[Christian, de, Portzamparc]"


### Using `getItem` for a `pyspark` array

**Note:** `getItem` doesn't allow negative indexing

In [66]:
from pyspark.sql.functions import last, size, col

(artwork
 .select(artwork.Artist)
 .withColumn('first', split(artwork.Artist, ' ').getItem(0))
 .withColumn('last', split(artwork.Artist, ' ').getItem(1)) # Oops!
 .take(2)) >> to_pandas

Unnamed: 0,Artist,first,last
0,Otto Wagner,Otto,Wagner
1,Christian de Portzamparc,Christian,de


### Example 2 - Splitting the Artists Bio using a regular expression

In [67]:
(artwork
 .select(artwork.ArtistBio)
 .withColumn('ArtistBio', regexp_replace('ArtistBio', '[()]', ''))
 .withColumn('ArtistiBioNew', split('ArtistBio', r' ,|, born|[-–]')) #Two types of "-" :(
 .take(5)
) >> to_pandas

Unnamed: 0,ArtistBio,ArtistiBioNew
0,"Austrian, 1841–1918","[Austrian, 1841, 1918]"
1,"French, born 1944","[French, 1944]"
2,"Austrian, 1876–1957","[Austrian, 1876, 1957]"
3,"French and Swiss, born Switzerland 1944","[French and Swiss, Switzerland 1944]"
4,"Austrian, 1876–1957","[Austrian, 1876, 1957]"


## EXTRACT

## Extracting by position

In `pyspark`, use `substring(str, start, len)` to extract a substring using location

**Not zero based!**


In [68]:
from pyspark.sql.functions import substring
(artwork
 .select(col('BeginDate'))
 .withColumn('BeginDate', regexp_replace(col('BeginDate'), '[()]', ''))
 .withColumn('century', substring(col('BeginDate'),1, 2))
 .withColumn('year_in_century', substring(col('BeginDate'),3, 2))
 .take(5)) >> to_pandas

Unnamed: 0,BeginDate,century,year_in_century
0,1841,18,41
1,1944,19,44
2,1876,18,76
3,1944,19,44
4,1876,18,76


## Extracting with a RegEx

In `pyspark`, use `regex_extract(str, pattern, group)`

* `str` is a column
* `pattern` is the RegEx pattern with 1+ group
* `group` is the group to be extracted
    * **Not zero based!**


In [69]:
from pyspark.sql.functions import regexp_extract

(artwork
 .select(col('ArtistBio'))
 .withColumn('country_of_birth', regexp_extract(col('ArtistBio'), r', born ([a-zA-Z]+)', 1))
 .withColumn('year_of_birth', regexp_extract(col('ArtistBio'), r'(\d{4})', 1))
 .withColumn('year_of_death', regexp_extract(col('ArtistBio'), r', (\d{4})–(\d{4})', 2)) #Extract group 2
 .take(5)) >> to_pandas

Unnamed: 0,ArtistBio,country_of_birth,year_of_birth,year_of_death
0,"(Austrian, 1841–1918)",,1841,1918.0
1,"(French, born 1944)",,1944,
2,"(Austrian, 1876–1957)",,1876,1957.0
3,"(French and Swiss, born Switzerland 1944)",Switzerland,1944,
4,"(Austrian, 1876–1957)",,1876,1957.0


## RECODE

### Recoding with a `dict` in `pyspark`

* Use `more_pyspark.recode(col, d, default=None)`
* `d` is the translation `dict`
* Use `default` keyword to add a default value

In [70]:
from more_pyspark import recode
from pyspark.sql.functions import col

new_sex = {'Male':'m', 'Female':'f'}

(artists
 .withColumn("Sex", recode('Gender', new_sex))
 .where(col('ConstituentID').isin([16,18]))
 .collect()) >> to_pandas



Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN,Sex
0,16,Cristobal Arteche,"Spanish, 1900–1964",Spanish,Male,1900,1964,,,m
1,18,Artko,,,,0,0,,,


### Providing a default value

In [71]:
(artists
 .withColumn("Sex", recode('Gender', new_sex, default='Unknown'))
 .where(col('ConstituentID').isin([16,18]))
 .collect()) >> to_pandas



Unnamed: 0,ConstituentID,DisplayName,ArtistBio,Nationality,Gender,BeginDate,EndDate,Wiki QID,ULAN,Sex
0,16,Cristobal Arteche,"Spanish, 1900–1964",Spanish,Male,1900,1964,,,m
1,18,Artko,,,,0,0,,,Unknown


## <font color="red"> Exercise 6.4.1</font>

Let's return to the `ArtistBio` column and practice all of our string processing function by cleaning up this column.

1. Use the REPLACE pattern to remove extra parentheses
2. Create a `Nationality` column by using the SPLIT and GET pattern to grab everything before the comma.
3. Use the EXTRACT pattern to extract the year of birth and death.  Note that you will also need to use the IF ELSE or CASE WHEN patterns to account for the artists that are still alive.
4. Use the RECODE pattern to create a column called `American` that will contains `'Yes'` if the artist is American and `'No'` otherwise.

In [74]:
from pyspark.sql.functions import when
american = {'American':'Yes'}
(artwork
 .select(artwork.ArtistBio)
 .withColumn('ArtistBio', regexp_replace('ArtistBio', '[()]', ''))
 .withColumn('Nationality', split(col("ArtistBio"),",").getItem(0))
 .withColumn('year_of_birth', regexp_extract(col('ArtistBio'), r'(\d{4})', 1))
 .withColumn('year_of_death', when(regexp_extract(col('ArtistBio'), r', (\d{4})–(\d{4})', 2) == "","Alive").otherwise(regexp_extract(col('ArtistBio'), r', (\d{4})–(\d{4})', 2)))
 .withColumn('American', recode('Nationality', american, default='No'))
 .collect()
) >> to_pandas

                                                                                

Unnamed: 0,ArtistBio,Nationality,year_of_birth,year_of_death,American
0,"Austrian, 1841–1918",Austrian,1841,1918,No
1,"French, born 1944",French,1944,Alive,No
2,"Austrian, 1876–1957",Austrian,1876,1957,No
3,"French and Swiss, born Switzerland 1944",French and Swiss,1944,Alive,No
4,"Austrian, 1876–1957",Austrian,1876,1957,No
...,...,...,...,...,...
150401,"American, 1861–1934 American, 1860–1933",American,1861,1934,Yes
150402,"Swiss, 1889–1943",Swiss,1889,1943,No
150403,"Swiss, 1889–1943",Swiss,1889,1943,No
150404,"Swiss, 1889–1943",Swiss,1889,1943,No


In [None]:
# Your code here