# Demo Notebook - Text Data Manipulation

### Description
The purpose of this notebook is to provide examples of common text data manipulations that can be used in Python to automate data processing tasks. The notebook will focus on the two most common types of text data: excel and csv. The following topics will be covered:

1. Reading multiple excel files from the same directory
2. Removing an excel header
3. Combining multiple excel sheets vertically (concatenate)
4. Removing duplicates
5. Removing missing values
6. Expanding 1 row with combined field into multiple rows
7. Reading a csv file
8. Merging excel data and csv data horizontally (join)
9. Outputting to new excel or csv file


In [42]:
# First we must import the pandas package that we will use to read our files
import pandas as pd

## 1. Reading multiple excel files from the same directory

In [43]:
# Import the glob package
import glob

# Define the directory we want to look at
directory = '1_Data/Excel_demo'
filenames = glob.glob(directory + "/*")

# What this next line does is create a new list of filenames with the string "excel_demo" in the name.
excel_filenames = [f for f in filenames if "excel_demo" in f]

## 2 + 3. Removing an excel header and combining multiple excel sheets vertically

In [44]:
# pd.concat is used to concatenate multiple dataframes
df_excel = pd.concat((pd.read_excel(f,
                                    sheet_name = "export_query_results",
                                    skiprows = 5) for f in excel_filenames)) #skiprows is used to skip headers

# If you only want to read one file you can use this instead:
# pd.read_excel("1_Data/Excel_demo/excel_demo_1.xlsx", sheet_name = "export_query_results", skiprows = 5)

## 4. Remove duplicates

In [45]:
df_excel = df_excel.drop_duplicates(keep = "first")

## 5. Remove missing values

In [46]:
df_excel = df_excel.dropna()

## 6. Expand a row into multiple rows based on delimited column
We will expand the ";" delimited column [Type] into multiple rows

In [47]:
# First convert [Type] into list form
df_excel['Type'] = df_excel['Type'].str.split(';')

In [49]:
# Then explode the column
df_excel = df_excel.explode('Type')

## 7. Read a single CSV file

In [51]:
df_csv = pd.read_csv("1_Data/Excel_demo/map.csv")

## 8. Merging excel data and csv data horizontally (join)

In [52]:
# This is a left join
df_merge = pd.merge(df_excel,
                    df_csv,
                    left_on = 'PR ID',
                    right_on = 'PR',
                    how = 'left')

## 9. Outputting to new excel or csv file

In [54]:
# Output to excel
df_merge.to_excel("1_Data/Excel_demo/excel_output.xlsx")

# Output to csv
df_merge.to_csv("1_Data/Excel_demo/csv_output.csv")