### Cleaning text using Python functions

In this exercise we will open up data that was scraped from `.msg` files, an email format that can be read and exported by Microsoft Outlook. The format can be read in Python using libraries created for that purpose but extracting data from these formats may create issues. 

This notebook is an example of how to use Python to clean this data. The notebook performs the following actions:
- loads data scraped from msg files
- detect patterns that need to be cleaned
- clean text in one column

This is a visual rendering of an `.msg` file as seen in an email client, such as Outlook:

<img src="../ring-alert-sample.png" alt="data sample" width="400px" align="left"/>

In [1]:
import pandas as pd

Load the data

In [2]:
scraped_data = pd.read_csv("../data/brookhavenmsg_extracts_2.csv")

scraped_data.head()

Unnamed: 0,subject,date,sender,to,cc,body_full,file_name
0,A Resident Posted a Crime Incident,"Wed, 17 Nov 2021 18:16:36 -0500",Ring Team <no-reply@neighborhoods.ring.com>,andrea.serrano@brookhavenga.gov,,Post Titled: Stolen Package at Berkshire at Le...,../data/neighbors_data/brookhaven/A Resident P...
1,A Resident Posted a Crime Incident,"Mon, 17 May 2021 08:38:51 -0400",Ring Team <no-reply@neighborhoods.ring.com>,travis.lewis@brookhavenga.gov,,Post Titled: Car\r\n ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌...,../data/neighbors_data/brookhaven/A Resident P...
2,A Resident Posted a Crime Incident,"Thu, 20 May 2021 23:47:46 -0400","""Ring Team"" <no-reply@neighborhoods.ring.com>",andrea.serrano@brookhavenga.gov,,Post Titled: One or two people checking for un...,../data/neighbors_data/brookhaven/A Resident P...
3,A Resident Posted a Crime Incident,"Sat, 09 Oct 2021 07:09:43 -0400",Ring Team <no-reply@neighborhoods.ring.com>,robert.orange@brookhavenga.gov,,Post Titled: Parked Cars destroyed at Briarhil...,../data/neighbors_data/brookhaven/A Resident P...
4,A Resident Posted a Crime Incident,"Thu, 10 Jun 2021 07:49:36 -0400","""Ring Team"" <no-reply@neighborhoods.ring.com>",travis.lewis@brookhavenga.gov,,Post Titled: Checking cars again in Peachtree ...,../data/neighbors_data/brookhaven/A Resident P...


### Look at the data and inspect it for issues
- set columns to be fully legible
- look at 1-2 examples

In [3]:
pd.set_option('display.max_colwidth', None)

In [4]:
scraped_data["body_full"].iloc[0]

'Post Titled: Stolen Package at Berkshire at Lenox Park\r\n \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c   \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c   \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c   \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c   \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c   \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c   \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c   \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c   \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c   \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c \u200c   \u200c \u2

#### Remedy identified issues

In the next lines we will use the following methods: 
- `.encode()` and `.decode()`: `.encode('ascii', 'ignore')` to get rid of`unicode` characters, such as `\u200c` and `.decode("utf-8")` (You can read more details about Unicode [here](https://docs.python.org/2/howto/unicode.html))
- `.replace()`: replace characters like `"\n"` with nothing `""`
- `.split()`: cut text into different parts based on characters

The solution is to use encoding as a way to clean the data:

In [5]:
scraped_data["no_unicode"] = scraped_data["body_full"].apply(lambda x: x.encode('ascii', 'ignore').decode("utf-8"))

scraped_data["no_unicode"].iloc[1]

'Post Titled: Car\r\n                                                                                                                                                                     \r\n \t \r\n \t Neighbors Public Safety Service <https://links.neighborhoods.ring.com/ls/click?upn=FHVCVoLBYI7Dvf39yZ-2F5txav887QW1brgG-2F-2BJ99vpUd9zicH1H3TQWs2jOlo2pRKZJG6_CxgEJZQrbN6Mz4P-2BglxdfridtC4-2BxiqaHpotgapJIlmlAH4dOvpEfarcqmmvUrphkI5s7ym30nwn-2FIU0RjSLqKWPtbv6zf-2FGOW5fBHaUUmZA3lHiRhVNCqE7hooqZIo-2Bngdz1cA-2FhM9LXKP5w29NgHYsdlH4dBuybdGgBVF23PHzqkqFcv5OGC4C510lekO-2B2KkLPTYUjMYIQRxQKEhPDyMxwOCZA2jkViLzoAHNz2-2BdvgLneL5sRbv2lUVKpmVuPytQfIudqFKBex8AJarUhvXv3d7S1tUz-2BrF51C1dvVcVWSiPfxK2ATIcC6juI-2Ft45YlockF509-2B8s6TjyMb0QcISqiObTgJkb6B9-2FHTsacVTAQUvKQ9vrY-2Bz62j4Zy3Gs2W1N7HEE3ZPCXi-2BjfL6MvFhOQ-2BgB0jMS75bD4EOBrhBtPqQu0PAtsKlPAbSPfTpJaSTVNV1TdOZNgN-2BWESREFck2sqrCtwvujhYFv9FvR-2BOWjHmf20jvSiw1eM4c3nco1cFUPzhEGs8zZlku3XUFwL5sRfFTz-2FLoMkdeiLGXjk69ieOs5XR7-2FKUn-2B5CKu2lO8OeIywHGLqpHxMCAuD5VgZc

In [6]:
scraped_data["replaced_chars"] = scraped_data["no_unicode"].apply(
    lambda x: x.replace("\r", "").replace("\n", "").replace("\t", "")
)

scraped_data["replaced_chars"].iloc[1]

'Post Titled: Car                                                                                                                                                                         Neighbors Public Safety Service <https://links.neighborhoods.ring.com/ls/click?upn=FHVCVoLBYI7Dvf39yZ-2F5txav887QW1brgG-2F-2BJ99vpUd9zicH1H3TQWs2jOlo2pRKZJG6_CxgEJZQrbN6Mz4P-2BglxdfridtC4-2BxiqaHpotgapJIlmlAH4dOvpEfarcqmmvUrphkI5s7ym30nwn-2FIU0RjSLqKWPtbv6zf-2FGOW5fBHaUUmZA3lHiRhVNCqE7hooqZIo-2Bngdz1cA-2FhM9LXKP5w29NgHYsdlH4dBuybdGgBVF23PHzqkqFcv5OGC4C510lekO-2B2KkLPTYUjMYIQRxQKEhPDyMxwOCZA2jkViLzoAHNz2-2BdvgLneL5sRbv2lUVKpmVuPytQfIudqFKBex8AJarUhvXv3d7S1tUz-2BrF51C1dvVcVWSiPfxK2ATIcC6juI-2Ft45YlockF509-2B8s6TjyMb0QcISqiObTgJkb6B9-2FHTsacVTAQUvKQ9vrY-2Bz62j4Zy3Gs2W1N7HEE3ZPCXi-2BjfL6MvFhOQ-2BgB0jMS75bD4EOBrhBtPqQu0PAtsKlPAbSPfTpJaSTVNV1TdOZNgN-2BWESREFck2sqrCtwvujhYFv9FvR-2BOWjHmf20jvSiw1eM4c3nco1cFUPzhEGs8zZlku3XUFwL5sRfFTz-2FLoMkdeiLGXjk69ieOs5XR7-2FKUn-2B5CKu2lO8OeIywHGLqpHxMCAuD5VgZchQYkUcDMMIwSEKWe

In [7]:
scraped_data["title"] = scraped_data["replaced_chars"].apply(lambda x: x.split("Neighbors Public Safety Service")[0]) 


In [8]:
scraped_data["title"].head()

0                                                         Post Titled: Stolen Package at Berkshire at Lenox Park                                                                                                                                                                         
1                                                                                               Post Titled: Car                                                                                                                                                                         
2    Post Titled: One or two people checking for unlocked car doors around 2am today. Clairmont near Buford Hwy.                                                                                                                                                                         
3                                                                Post Titled: Parked Cars destroyed at Briarhill                                          

#### Reduce and export the data

In [9]:
scraped_data.columns

Index(['subject', 'date', 'sender', 'to', 'cc', 'body_full', 'file_name',
       'no_unicode', 'replaced_chars', 'title'],
      dtype='object')

In [10]:
columns = ['subject', 'date', 'sender', 'to', 'cc', 'title']

scraped_data[columns].to_csv("../output/scraped_data.csv")