# Task 1

This assignment deals with identifying and cleaning product brand strings from raw data obtained from the source website.

When data is obtained (scraped) from a source website, the brand string seen on the web page of a particular product may be written in a multitude of different ways. Some brands may be more complicated to identify than others depending on the variety of ways their products are presented by source websites.

In this task, you will be looking for brand strings that associate with the brand Ralph Lauren in a raw data file called “Distinct_Brands.txt”.

Download the file and take some time to analyze the various ways the Ralph Lauren brand may be scraped directly from the page. Some examples are below:

ralph lauren
RalphLauren
ralph-lauren
Polo Ralph Lauren
Polo-Ralph-Lauren
Polo Raulph
Polo Ralph Lauren Woven Stripe Pajama Pants

There are many other strings in this file identifying products with the brand Ralph Lauren. The above list is just a small subset of those strings to give you an idea.

Your task is to study the raw file and use your best judgement to identify all the strings that are affiliated with the brand Ralph Lauren. Subsequently, write a Python script that reads in the “Distinct_Brands.txt” file, applies an approach to identify all the Ralph Laurent related strings and outputs those strings to another file. Please commit both your code and the output file containing the strings you identified to the GitHub repository.


In [1]:
#import Python Packages
import re
import pandas as pd

In [2]:
#read in text file, there are some latin characters in there, make sure to set encoding
with open('C:\\Users\\jltsa\\Desktop\\hearful\\Data\\Distinct_Brands.txt', encoding='latin-1', mode='r') as file:
    brands = file.read()

In [3]:
#list comprehension
#replace the '"' with nothing, strip white space
#removing the characters \,/(), {}, <> seems ok, 
#other characters might be included in brand names
names = [re.sub('[\\\(){}<>/|]', '', x.replace('"', '').strip()) for x in brands.split(',')]

After some exploration of the file there are no mispellings for:

'ralf'

'raulf'

'lawren'

'rlph'

'ralh'

MISPELLINGS:

'raulph'

For the regular expression to find associations with Ralph Lauren we want to ignore what case the letters are in.  Therefore, we specify the argument re.IGNORECASE.  I want to find any cases where there might be typos for extra letters.  So after every character we specify in the regular expression that we want to find one or more.  I searched the doc by opening the file in a text editor to look at possible misspellings.  There was one that I saw 'Raulph' and the letter 'u' is an optional character in the regular expression.  Also, we need to account for optional white spaces and and non alpha characters between Ralph and Lauren.

In [55]:
rl = []
regex = re.compile(r'(r+a+u*l+p+h+)\s*\W*(l+a+u+r+e+n+)', re.IGNORECASE)

for text in names:
    if re.search(regex, text):
        rl.append(text)


In [56]:
print('Found: '+ str(len(rl)) + ' strings associated with Ralph Lauren')

Found: 2241 strings associated with Ralph Lauren


In [49]:
# Cast to Pandas Series
rl_ser = pd.DataFrame(rl)

In [52]:
# Save file as .csv
# If needed as a list, can also use pickle
rl_ser.to_csv('rl_strings.csv', index=False)

A Python script named rl_strings.py exists in the Task_1 directory.

To run the script.  Open command line and path into the directory containing the rl_strings.py script. 

Then type 'python rl_strings.py Distinct_Brands.txt rl_strings.csv' in the command line.

You can run this notebook as well to also create the .csv file.