# Basic Concepts of String Manipulation

Start your journey into the regular expression world! From slicing and concatenating, adjusting the case, removing spaces, to finding and replacing strings. You will learn how to master basic operation for string manipulation using a movie review dataset.

In [7]:
movie = 'oh my God! desserts I stressed was an ugly movie'
movie[::3]

'omG!ees rs sng v'

In [1]:
movie[::3]

NameError: name 'movie' is not defined

In [2]:
# Get the word
movie_title = movie[11:30]

# Obtain the palindrome
palindrome = movie_title[::-1]

# Print the word if it's a palindrome
if movie_title == palindrome:
    print(movie_title)

desserts I stressed


## Normalizing reviews
It's time to extract some important words present in your movie review dataset. First, you need to normalize them and then, count their frequency. Part of the normalization implies converting all the words to lowercase, removing special characters and extracting the root of a word so you count the variants as one.
So imagine you have the following reviews: The movie surprises me very much and Marvel movies always surprise their audience. If you count the word frequency, you will count surprises one time and surprise one time. However, the verb surprise appears in both and its frequency should be two.

In [5]:
# Convert to lowercase and print the result
movie = '$I supposed that coming from MTV Films I should expect no less$'
movie_lower = movie.lower()
print(movie_lower)

$i supposed that coming from mtv films i should expect no less$


In [6]:
# Remove whitespaces and print the result
movie_no_space = movie_lower.strip("$") #el .strip es para sacar caracteres especiales q le pido
print(movie_no_space)

i supposed that coming from mtv films i should expect no less


In [7]:
movie_split = movie_no_space.split()
print(movie_split)

['i', 'supposed', 'that', 'coming', 'from', 'mtv', 'films', 'i', 'should', 'expect', 'no', 'less']


In [15]:
# Convert to lowercase and print the result
movie = '$I supposed that coming from MTV Films I should expect no less$'
movie_lower = movie.lower()
print(movie_lower)

# Remove whitespaces and print the result
movie_no_space = movie_lower.strip("$") #el .strip es para sacar caracteres especiales q le pido
print(movie_no_space)

movie_split = movie_no_space.split()
print(movie_split)

# Select root word and print the result
word_root = movie_split[1][:-1]
print(word_root)

$i supposed that coming from mtv films i should expect no less$
i supposed that coming from mtv films i should expect no less
['i', 'supposed', 'that', 'coming', 'from', 'mtv', 'films', 'i', 'should', 'expect', 'no', 'less']
suppose


## Time to join!
While normalizing your text, you noticed that one review had a particular structure. This review ends with the HTML tag `<\i>` and it has a lot of commas in different places of the sentence. You decide to remove the tag from the end and use the strategy of splitting the string and joining it back again without the commas.

In [5]:
movie = 'the film,however,is all good<\i>'

# Remove tags happening at the end and print results
movie_tag = movie.rstrip("<\i>") #rstrip,me saca caracteres especiales q le pido
print(movie_tag)

the film,however,is all good


In [10]:
# Split the string using commas and print results
movie_no_comma = movie_tag.split(",")
print(movie_no_comma)

['the film', 'however', 'is all good']


In [11]:
# Join back together and print results
movie_join = ' '.join(movie_no_comma)
print(movie_join)

the film however is all good


In [72]:
file = 'mtv films election, a high school comedy, is a current example\nfrom there, director steven spielberg wastes no time, taking us into the water on a midnight swim'

# Split string at line boundaries
file_split = file.split('\n')

# Print file_split
print(file_split)

# Complete for-loop to split by commas
for substring in file_split:
    substring_split = substring.split(',')
    print(substring_split)

import re
algo = 'steven'
re.findall(algo,file)

['mtv films election, a high school comedy, is a current example', 'from there, director steven spielberg wastes no time, taking us into the water on a midnight swim']
['mtv films election', ' a high school comedy', ' is a current example']
['from there', ' director steven spielberg wastes no time', ' taking us into the water on a midnight swim']


['steven']

## Finding a substring
It's a new day at work and you need to continue cleaning your dataset for the movie prediction project. While exploring the dataset, you notice a strange pattern: there are some repeated, consecutive words occurring between the character at position 37 and the character at position 41. You decide to write a function to find out which movie reviews show this peculiarity, remembering that the ending position you specify is not inclusive. If you detect the word, you also want to change the string by replacing it with only one instance of the word using `.replace()`.

Complete the `if-else` statement following the instructions.

Find if a pattern occurs between the characters 1 and 4 (inclusive) of string using `string.find(pattern, 1, 5)`. If not found, `.find()` will return -1.

In [15]:
movies = ['I believe you I always said that the actor actor is amazing in every movie he has played', 'it s astonishing how frightening the actor actor actor norton looks with a shaved head and a swastika on his chest.']

for movie in movies:
    # Find if actor occurrs between 37 and 41 inclusive
    if movie.find("actor", 37, 42) == -1:
        print("Word not found")
    # Count occurrences and replace two by one
    elif movie.count("actor") == 2:  
        print(movie.replace("actor actor", "actor"))
    else:
        # Replace three occurrences by one
        print(movie.replace("actor actor actor", "actor"))

I believe you I always said that the actor is amazing in every movie he has played
it s astonishing how frightening the actor norton looks with a shaved head and a swastika on his chest.


In [18]:
for movie in movies:
  # Find the first occurrence of word
  print(movie.find('he')) # ' he '

74
-1


In [68]:
movies = "the rest of the story isn't important because all it does is serve as a mere backdrop for the two stars to share the screen ."

# Replace negations 
movies_no_negation = movies.replace("isn't", "is")

# Replace important
movies_antonym = movies_no_negation.replace("important", "insignificanteee")

# Print out
print(movies_antonym)

#\d: digit
# \w: word character
# \W: non-word character
# \s: whitespace
import re
buscador='is'
re.findall(buscador,movies)

the rest of the story is insignificanteee because all it does is serve as a mere backdrop for the two stars to share the screen .


['is', 'is']

In [23]:
# Assign the substrings to the variables
wikipedia_article = 'In computer science, artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.'

first_pos = wikipedia_article[3:19].lower()
second_pos = wikipedia_article[21:44].lower()

# Define string with placeholders 
my_string = "The tool {} is used in {}"
print(my_string.format(first_pos, second_pos))

The tool computer science is used in artificial intelligence


In [24]:
my_string = "The tool {1} is used in {0}"
print(my_string.format(first_pos, second_pos))

The tool artificial intelligence is used in computer science


In [60]:
courses = ['artificial intelligence', 'neural networks']

# Create a dictionary
plan = {"field": courses[0],
        "tool": courses[1]}

# Complete the placeholders accessing elements of field and tool keys
my_message = "If you are interested in {data[field]}, you can take the course related to {data[tool]}"

# Use dictionary to replace placeholders
print(my_message.format(data=plan))

import re
regex = '#\d\W'
re.findall(regex,my_message)

If you are interested in artificial intelligence, you can take the course related to neural networks


[]

## What day is today?
It's lunch time and you are talking with some of your colleagues. They comment that they feel that every morning someone should send them a reminder of what day it is so they can check in the calendar what their assignments are for that day.

You want to help out and decide to write a small script that takes the date and time of the day so that every morning, a message is sent to your colleagues. You can use the module datetime along with named placeholders to achieve your goal.

The date should be expressed as `Month day, year`, e.g. `April 16, 2019` and the time as `hh:mm`, e.g. `16:30`.

You write down some specifiers to help you: `%d`(day), `%B` (month name), `%m` (month number), `%Y`(year), `%H` (hour) and `%M`(minutes)

In [24]:
# Import datetime 
from datetime import datetime

# Assign date to get_date
get_date = datetime.now()
get_date


datetime.datetime(2020, 10, 28, 17, 19, 23, 150281)

In [28]:
from datetime import datetime
tengo_fecha= datetime.now()
tengo_fecha

datetime.datetime(2020, 10, 28, 17, 24, 10, 838630)

In [26]:
# Add named placeholders with format specifiers
message = "Good morning. Today is {today:%B %d, %Y}. It's {today:%H:%M} ... hora de trabajar!"

# Format date
print(message.format(today=get_date))

Good morning. Today is October 28, 2020. It's 17:19 ... hora de trabajar!


In [48]:
mensaje = 'hola vago. es hora levantarse. hoy es {today:%d %B, %Y}. y son las {today:%H:%M}...daleee'
print(mensaje.format(today=tengo_fecha))

hola vago. es hora levantarse. hoy es 28 October, 2020. y son las 17:24...daleee


In [25]:
# f-strings!
field1 = 'sexiest job'
field2 = 'data is produced daily'
field3 = 'Individuals'

fact1 = '21'
fact2 = 2500000000000000000
fact3 = 72.41415415151
fact4 = 1.09

f"Data science is considered '{field1}' in the {fact1}st century"

print(f"About {fact2:e} of {field2} in the world") 
print(f"{field3} create around {fact3:.2f}% of the data but only {fact4:.1f}% is analyzed")

About 2.500000e+18 of data is produced daily in the world
Individuals create around 72.41% of the data but only 1.1% is analyzed


In [49]:
# Complete the f-string
f"Data science is considered '{field1}' in the {fact1}st century"

NameError: name 'field1' is not defined

In [10]:
# Complete the f-string
print(f"About {fact2:e} of {field2} in the world")

About 2.500000e+18 of data is produced daily in the world


In [11]:
# Complete the f-string
print(f"{field3} create around {fact3:.2f}% of the data but only {fact4:.1f}% is analyzed")

Individuals create around 72.41% of the data but only 1.1% is analyzed


You can set string Templates to speed up.

In [50]:
#tools = ['Natural Language Toolkit', '20', 'month']
cosas= ['ponele onda','el hernu','cagamos']
# Templates
from string import Template


# Select variables
#our_tool = tools[0]
#our_fee = tools[1]
#our_pay = tools[2]
jeje= chi[0]
jaja= cho[1]
jojo= chun[2]
# Create template
#course = Template("We are offering a 3-month beginner course on $tool just for $fee ${pay}ly")
clarinete= Template(' estamos viendo si chon viene a jugar sino chuni . asi que chin')
# Substitute identifiers with three variables
#print(course.substitute(tool=our_tool,fee=our_fee, pay=our_pay))
print(cosas.substitute(chon=jaja, chuni=jeje, chin=jojo))

NameError: name 'chi' is not defined

## Are they bots?
The company that you are working for asked you to perform a sentiment analysis using a dataset with tweets. First of all, you need to do some cleaning and extract some information.
While printing out some text, you realize that some tweets contain user mentions. Some of these mentions follow a very strange pattern. A few examples that you notice: `@robot3!`, `@robot5&` and `@robot7#`

To analyze if those users are bots, you will do a proof of concept with one tweet and extract them using the `.findall()` method.

You write down some helpful metacharacters to help you later:

    \d: digit
    \w: word character
    \W: non-word character
    \s: whitespace

In [63]:
# Import the re module
import re

sentiment_analysis = '@robot9! @robot4& I have a good feeling that the show isgoing to be amazing! @robot9$ @robot7%'

# Write the regex
regex = r"@robot\d\W" # what about @robot\d+\W

# Find all matches of regex
print(re.findall(regex, sentiment_analysis))

['@robot9!', '@robot4&', '@robot9$', '@robot7%']


In [22]:
# Write a regex to obtain user mentions
print(re.findall(r"User_mentions:\d", sentiment_analysis))

[]


In [46]:
sentiment_analysis = "Unfortunately one of those moments wasn't a giant squid monster. User_mentions:2, likes: 9, number of retweets: 7"

# Write a regex to obtain user mentions
print(re.findall(r"User_mentions:\d", sentiment_analysis))

['User_mentions:2']


In [47]:
# Write a regex to obtain number of likes
#Use \d to indicate digits and \s to indicate whitespace.
print(re.findall(r"likes:\s\d", sentiment_analysis))

['likes: 9']


In [66]:
texto= '@robot9! @robot4& I have a good feeling that the show isgoing to be amazing! @robot9$ @robot7%'
import re
busco='@robot\d\W'
print(re.findall(busco,texto))

['@robot9!', '@robot4&', '@robot9$', '@robot7%']


In [48]:
# Write a regex to obtain number of retweets
print(re.findall(r"number\sof\sretweets:\s\d", sentiment_analysis))

['number of retweets: 7']


## Match and split
Some of the tweets in your dataset were downloaded incorrectly. Instead of having spaces to separate words, they have strange characters. You decide to use regular expressions to handle this situation. You print some of these tweets to understand which pattern you need to match.

You notice that the sentences are always separated by a special character, followed by a number, the word break, and after that, another special character, e.g `&4break!`. The words are always separated by a special character, the word new, and a normal random character, e.g `#newH`.

In [49]:
sentiment_analysis = 'He#newHis%newTin love with$newPscrappy. #8break%He is&newYmissing him@newLalready'

# Write a regex to match pattern separating sentences
regex_sentence = r"\W\dbreak\W"

# Replace the regex_sentence with a space
sentiment_sub = re.sub(regex_sentence, " ", sentiment_analysis)

# Write a regex to match pattern separating words
regex_words = r"\Wnew\w"

# Replace the regex_words and print the result
sentiment_final = re.sub(regex_words, " ", sentiment_sub)
print(sentiment_final)

He is in love with scrappy.  He is missing him already


## Everything clean
Back to your Twitter sentiment analysis project! There are several types of strings that increase your sentiment analysis complexity. But these strings do not provide any useful sentiment. Among them, we can have links and user mentions.

In order to clean the tweets, you want to extract some examples first. You know that most of the times links start with http and do not contain any whitespace. User mentions start with `@` and can have letters and numbers only, e.g. `@johnsmith3`.

You write down some helpful quantifiers to help you: `*` zero or more times, `+` once or more, `?` zero or once.

To match a pattern that starts with sequence and has no whitespace, use sequence and `\S+`. To find all matches, use the method `.findall()`.

To match a pattern that starts with `@` symbol and can contain letters and numbers, use `@` and `\w+`. To find all matches, use the method `.findall()`

In [74]:
sentiment_analysis = ['Boredd. Colddd @blueKnight39 @hernancito Internet keeps stuffing up. Save me! https://www.tellyourstory.com',
                     'I had a horrible nightmare last night @anitaLopez98 @MyredHat31 which affected my sleep, now Im really tired',
                     'im lonely  keep me company @YourBestCompany! @foxRadio https://radio.foxnews.com 22 female, new york,@hernu']

for tweet in sentiment_analysis:
    # Write regex to match http links and print out result
    print(re.findall(r"http\S+", tweet))

    # Write regex to match user mentions and print out result
    print(re.findall(r"@\w+", tweet))
    print('\n')

['https://www.tellyourstory.com']
['@blueKnight39', '@hernancito']


[]
['@anitaLopez98', '@MyredHat31']


['https://radio.foxnews.com']
['@YourBestCompany', '@foxRadio', '@hernu']




## Write a regex matching the hashtag pattern
To match a letter or a number, use `\w`. If you want these character to be repeated once or multiple times, you can use `+`. The hashtag symbol will match itself.

In [76]:
sentiment_analysis = 'ITS NOT ENOUGH TO SAY THAT IMISS U #MissYou #SoMuch #Friendship #Forever'
sentiment_analysis.lower()
regex = r"#\w+"

re.findall(regex, sentiment_analysis)

['#MissYou', '#SoMuch', '#Friendship', '#Forever']

To split a text at every pattern match, use `.split()`. To specify you want to split the text at one or more consecutive whitespace (`\s`), use the `+` quantifier.

In [53]:
# Replace the regex by an empty string
no_hashtag = re.sub(regex, "", sentiment_analysis)


print(re.split(r"\s+", no_hashtag))

['ITS', 'NOT', 'ENOUGH', 'TO', 'SAY', 'THAT', 'IMISS', 'U', '']


## Write a regex to match a valid email address
To choose between different characters use `[]`. Use `a-z` for lowercase, `A-Z` for uppercase letters and `0-9` for numbers. Don't forget to escape `.` and `$` as they have another meaning. Use `\w` for any word character.

In [58]:
emails = ['n.john.smith@gmail.com', '87victory@hotmail.com', '!#mary-=@msca.net']

regex = r"[A-Za-z0-9!#%&*\$\.]+@\w+\.com"

for example in emails:
    # Match the regex to the string
    if re.match(regex, example):
        # Complete the format method to print out the result
        print("The email {email_example} is a valid email".format(email_example=example))
    else:
        print("The email {email_example} is invalid".format(email_example=example))   

The email n.john.smith@gmail.com is a valid email
The email 87victory@hotmail.com is a valid email
The email !#mary-=@msca.net is invalid


In [70]:
regex = r"[A-Za-z0-9!#\-=%&*\$\.]+@\w+\."

for example in emails:
    # Match the regex to the string
    if re.match(regex, example):
        # Complete the format method to print out the result
        print("The email {email_example} is a valid email".format(email_example=example))
    else:
        print("The email {email_example} is invalid".format(email_example=example))   

The email n.john.smith@gmail.com is a valid email
The email 87victory@hotmail.com is a valid email
The email !#mary-=@msca.net is a valid email


## Invalid password
The second part of the website project is to write a script that validates the password entered by the user. The company also puts some rules in order to verify valid passwords:

- It can contain lowercase `a-z` and uppercase letters `A-Z`
- It can contain numbers `0-9`
- It can contain the symbols: `*`, `#`, `$`, `%`, `!`, `&`, `.`
- It must be at least 8 characters long but not more than 20. Use `{}`

In [61]:
passwords = ['Apple34!rose', 'My87hou#4$', 'abc123']

# Write a regex to match a valid password
regex = r"[A-Za-z0-9!#%&*\$\.]{8,20}"

for example in passwords:
    # Scan the strings to find a match
    if re.search(regex, example):
        # Complete the format method to print out the result
        print("The password {pass_example} is a valid password".format(pass_example=example))
    else:
        print("The password {pass_example} is invalid".format(pass_example=example))  

The password Apple34!rose is a valid password
The password My87hou#4$ is a valid password
The password abc123 is invalid


## GROUPINGS

To capture a group, place parentheses to surround that group: `(group)regex`. To match any lowercase letter use `a-z`, any uppercase use `A-Z` and numbers `0-9`. Use `[]` to indicate optional characters. Use `+` to match once or more times. `@` will match itself.

In [72]:
sentiment_analysis = ['Just got ur newsletter, those fares really are unbelievable. Write to statravelAU@gmail.com or statravelpo@hotmail.com. They have amazing prices',
                      'I should have paid more attention when we covered photoshop in my webpage design class in undergrad. Contact me Hollywoodheat34@msn.net.',
                      'hey missed ya at the meeting. Read your email! msdrama098@hotmail.com']


# Write a regex that matches email
regex_email = r"([A-Za-z0-9]+)@\S+"

for tweet in sentiment_analysis:
    # Find all matches of regex in each tweet
    email_matched = re.findall(regex_email, tweet)

    # Complete the format method to print the results
    print("Lists of users found in this tweet: {}".format(email_matched))

Lists of users found in this tweet: ['statravelAU', 'statravelpo']
Lists of users found in this tweet: ['Hollywoodheat34']
Lists of users found in this tweet: ['msdrama098']
