Methods for data cleaning

In [1]:
from bs4 import BeautifulSoup

def remove_html_and_script(text):
    soup = BeautifulSoup(text, "html.parser")

    # Remove script and style tags completely
    for tag in soup(["script", "style"]):
        tag.decompose()

    return soup.get_text(strip=True)

import unicodedata

def normalize_unicode(text):
    return unicodedata.normalize('NFKC', text)

Testing data cleaning methods

In [4]:
# text to be cleaned
text1 = "ebm\u2011papst Invests \u20ac30 Million in New Site in Romania"
text2 = "Let\u2019s decarbonize district heating<br><br> together at this year\u2019s Euroheat &amp; Power Congress in Prague, Czech Republic, where"
text3 = 'Mark your calendars for May 27, 2025, at 11:00 CET, as we\\u2019re hosting an online session; Retrofitting Commercial Buildings: How E'
#clean_text2 method works with text3



In [37]:
print(text1)

ebm‑papst Invests €30 Million in New Site in Romania


In [84]:
# Create a DataFrame with one column and one row
data = {'Title': ['More Displacement, Same Iconic Series \u2013 06.05.2025 Secop SCE Plus Video \u2013 Behind the Scenes']}
df = pd.DataFrame(data)

pd.set_option('display.max_colwidth', None)  # Set to None to display the full content of the column

df['Title'][0] = remove_html_and_script(df['Title'][0])
df['Title'][0] = normalize_unicode(df['Title'][0])

# Display the DataFrame
df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['Title'][0] = remove_html_and_script(df['Title'][0])
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the origi

Unnamed: 0,Title
0,"More Displacement, Same Iconic Series – 06.05.2025 Secop SCE Plus Video – Behind the Scenes"


In [79]:
output

'More Displacement, Same Iconic Series – 06.05.2025 Secop SCE Plus Video – Behind the Scenes'

Testing data cleaning on csv file records

In [1]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("../SP_Test.csv")

# Display the first few rows of the DataFrame
df

Unnamed: 0,Title,Summary,URL
0,Our skilled trades colleagues are an essential...,Our skilled trades colleagues are an essential...,https://www.linkedin.com/feed/update/urn:li:ac...
1,Gefahrgut automatisiert im KV-Terminal abwickeln,"Projektpartner LKZ Prienn, Concroo, Duss und K...",https://www.eurotransport.de/logistik/speditio...
2,"Mark your calendars for May 27, 2025, at 11:00...","Mark your calendars for May 27, 2025, at 11:00...",https://www.linkedin.com/feed/update/urn:li:ac...
3,Refrigerant Innovations Are The Focus at Oklah...,The school is partnering with the HVAC indust...,https://www.achrnews.com/articles/164551-refri...
4,ATMO Australia: R290 Commercial HVAC Case Stud...,"ATMOsphere COO and Head of APAC, Jan Dusek, s...",https://naturalrefrigerants.com/atmo-australia...
5,Hupac beklagt Bahnprobleme : Negativer Trend i...,Der Schweizer Kombi-Operateur Hupac beklagt di...,https://www.eurotransport.de/logistik/speditio...
6,"In the high-temperature #HeatPump sector, the ...","In the high-temperature #HeatPump sector, the ...",https://www.linkedin.com/feed/update/urn:li:ac...
7,"More Displacement, Same Iconic Series \u2013 0...",Secop\u2019s new SCE Plus compressor range ex...,https://www.secop.com/updates/news/secop-sce-p...
8,J &amp; E Hall are pleased to announce Grahame...,J &amp; E Hall are pleased to announce Grahame...,https://www.linkedin.com/feed/update/urn:li:ac...
9,Streit um den Baufortschritt bei Brücken,"""In wesentlichen Punkten irreführend und besch...",https://www.eurotransport.de/logistik/verkehrs...


In [70]:
test = df

test['Title'] = test['Title'].apply(remove_html_and_script)
test['Title'] = test['Title'].apply(normalize_unicode)

test['Summary'] = test['Summary'].apply(remove_html_and_script)
test['Summary'] = test['Summary'].apply(normalize_unicode)
row = test.iloc[19]
row['Title']

'ebm\\u2011papst Invests \\u20ac30 Million in New Site in Romania'

In [72]:
row['Title'] = remove_html_and_script(row['Title'])
row['Title'] = normalize_unicode(row['Title'])
row['Title']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['Title'] = remove_html_and_script(row['Title'])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row['Title'] = normalize_unicode(row['Title'])


'ebm\\u2011papst Invests \\u20ac30 Million in New Site in Romania'

In [73]:
# Iterate through each record in the DataFrame and apply functions in place
for index, row in test.iterrows():
    # Print original values
    original_title = row['Title']
    original_summary = row['Summary']
    print(f"Original Title (Row {index}): {original_title}")
    print(f"Original Summary (Row {index}): {original_summary}")

    # Apply the functions in place and update the DataFrame
    test.at[index, 'Title'] = normalize_unicode(remove_html_and_script(original_title))
    test.at[index, 'Summary'] = normalize_unicode(remove_html_and_script(original_summary))

    # Print cleaned values
    print(f"Cleaned Title (Row {index}): {test.at[index, 'Title']}")
    print(f"Cleaned Summary (Row {index}): {test.at[index, 'Summary']}")
    print("\n" + "-"*50)  # Separator for readability


Original Title (Row 0): Our skilled trades colleagues are an essential part of not only our company but also society. These professionals bring the buil
Original Summary (Row 0): Our skilled trades colleagues are an essential part of not only our company but also society. These professionals bring the buildings where we live, work and play to life. They show up, solve problems and keep buildings -and our business- moving forward. Thank you for your dedication to our customers!#SkilledTradesDay #NationalSkilledTradesDay
Cleaned Title (Row 0): Our skilled trades colleagues are an essential part of not only our company but also society. These professionals bring the buil
Cleaned Summary (Row 0): Our skilled trades colleagues are an essential part of not only our company but also society. These professionals bring the buildings where we live, work and play to life. They show up, solve problems and keep buildings -and our business- moving forward. Thank you for your dedication to our custom

In [63]:
# Apply cleaning functions in place on both 'Title' and 'Summary' columns
test['Title'] = test['Title'].apply(lambda x: normalize_unicode(remove_html_and_script(str(x))))
test['Summary'] = test['Summary'].apply(lambda x: normalize_unicode(remove_html_and_script(str(x))))

# Print the cleaned DataFrame
test


Unnamed: 0,Title,Summary,URL
0,Our skilled trades colleagues are an essential...,Our skilled trades colleagues are an essential...,https://www.linkedin.com/feed/update/urn:li:ac...
1,Gefahrgut automatisiert im KV-Terminal abwickeln,"Projektpartner LKZ Prienn, Concroo, Duss und K...",https://www.eurotransport.de/logistik/speditio...
2,"Mark your calendars for May 27, 2025, at 11:00...","Mark your calendars for May 27, 2025, at 11:00...",https://www.linkedin.com/feed/update/urn:li:ac...
3,Refrigerant Innovations Are The Focus at Oklah...,The school is partnering with the HVAC industr...,https://www.achrnews.com/articles/164551-refri...
4,ATMO Australia: R290 Commercial HVAC Case Stud...,"ATMOsphere COO and Head of APAC, Jan Dusek, sp...",https://naturalrefrigerants.com/atmo-australia...
5,Hupac beklagt Bahnprobleme : Negativer Trend i...,Der Schweizer Kombi-Operateur Hupac beklagt di...,https://www.eurotransport.de/logistik/speditio...
6,"In the high-temperature #HeatPump sector, the ...","In the high-temperature #HeatPump sector, the ...",https://www.linkedin.com/feed/update/urn:li:ac...
7,"More Displacement, Same Iconic Series \u2013 0...",Secop\u2019s new SCE Plus compressor range exp...,https://www.secop.com/updates/news/secop-sce-p...
8,J & E Hall are pleased to announce Grahame Kee...,J & E Hall are pleased to announce Grahame Kee...,https://www.linkedin.com/feed/update/urn:li:ac...
9,Streit um den Baufortschritt bei Brücken,"""In wesentlichen Punkten irreführend und besc...",https://www.eurotransport.de/logistik/verkehrs...


In [77]:
import pandas as pd
from bs4 import BeautifulSoup
import unicodedata

# Method to remove HTML and script tags from text
def remove_html_and_script(text):
    soup = BeautifulSoup(text, "html.parser")

    # Remove script and style tags completely
    for tag in soup(["script", "style"]):
        tag.decompose()

    return soup.get_text(strip=True)

# Method to normalize unicode characters
def normalize_unicode(text):
    normalized_text = unicodedata.normalize('NFKD', text)
    return normalized_text

# Method to clean a DataFrame column by applying both functions
def clean_column(df, column_name):
    # Apply both remove_html_and_script and normalize_unicode in sequence
    df[column_name] = df[column_name].apply(lambda x: normalize_unicode(remove_html_and_script(x)))

    return df

# Example usage
# Assuming 'df' is your DataFrame and 'Title' is the column to clean
# df = pd.read_csv("your_file.csv")  # Uncomment this to load the CSV

# Clean the 'Title' column in place
test = clean_column(test, 'Title')

# Clean the 'Summary' column in place
test = clean_column(test, 'Summary')

# Show cleaned data
print(test[['Title', 'Summary']])

print(test['Title'].iloc[7])

                                                Title  \
0   Our skilled trades colleagues are an essential...   
1    Gefahrgut automatisiert im KV-Terminal abwickeln   
2   Mark your calendars for May 27, 2025, at 11:00...   
3   Refrigerant Innovations Are The Focus at Oklah...   
4   ATMO Australia: R290 Commercial HVAC Case Stud...   
5   Hupac beklagt Bahnprobleme : Negativer Trend i...   
6   In the high-temperature #HeatPump sector, the ...   
7   More Displacement, Same Iconic Series \u2013 0...   
8   J & E Hall are pleased to announce Grahame Kee...   
9           Streit um den Baufortschritt bei Brücken   
10  IIR Highlights Natural Refrigerants as Sustain...   
11         Nagel-Group baut Logistikzentrum in Danzig   
12  AHT Launches SPI CIRCUMPOLAR Modular Pump Stat...   
13  In the quest for sustainable alternatives to g...   
14  Meet our colleague Janusz Kieruzel. Janusz beg...   
15  Alfa Laval will unveil its large capacity oliv...   
16    2025 Eurovent Summit reve

In [2]:
# writing test to a new CSV file

test1 = df.copy()

#test1['Title'] = test1.applymap(lambda x: normalize_unicode(remove_html_and_script(x)) if isinstance(x, str) else (print(f"Non-string value: {x}") or x))
test1['Title'] = test1['Title'].map((lambda x: x.encode('unicode-escape')))
test1['Summary'] = test1['Summary'].map((lambda x: x.encode('unicode-escape')))

# Write to CSV
test1.to_csv("../writes_test.csv", index=False, encoding='utf-8')