## Text documents

Humans primarily communicate through text, but text itself is unstructured data. 

Unstructured data encompasses information lacking a predetermined format or organization. In contrast to structured data neatly fitting into databases with designated fields, unstructured data lacks this clear structure. Examples include raw text, emails, social media posts, videos, and images, necessitating processing and analysis for meaningful interpretation.

## Obtain dataset

You might download any dataset you want from these websites:

Archive.org: https://archive.org/


![image.png](attachment:image.png)

## .txt documents

In [2]:
import os

# import training image
def locate_child_file(child_folder, file_name):
    """This function assumes that the file is located in the same directory as the script.
    The file is packed within a folder. No nested folder is allowed."""
    script_directory = os.getcwd()
    # os.path.join() method in Python join one or more path components intelligently. It returns a string representing the path to a file.
    return os.path.join(script_directory, child_folder, file_name)

sample_txt = locate_child_file('Txt', 'humanity_paper.txt')

# Read the content of the text file
try:
    with open(sample_txt, 'r', encoding='utf-8') as file:
        file_content = file.read()
        print("File Content:")
        print(file_content)
except FileNotFoundError:
    print(f"The file '{sample_txt}' was not found.")
except Exception as e:
    print(f"An error occurred: {e}")




File Content:
Technological Forecasting & Social Change 186 (2023) 122154
Available online 17 November 2022
0040-1625/© 2022 Elsevier Inc. All rights reserved.Examining the role of virtue ethics and big data in enhancing viable, 
sustainable, and digital supply chain performance 
Surajit Baga,b,*, Muhammad Sabbir Rahmanc, Gautam Srivastavad, Adam Shoree, 
Pratibha Ramf 
aCentre for Data Science, Institute of Management Technology, Ghaziabad, India 
bDepartment of Transport and Supply Chain Management, University of Johannesburg, South Africa 
cDepartment of Marketing and International Business, School of Business and Economics, North South University, Dhaka, Bangladesh 
dIILM Graduate School of Management, 16, Knowledge Park II, Greater Noida, Uttar Pradesh 201306, India 
eFaculty of Business and Law, Liverpool Business School, United Kingdom of Great Britain and Northern Ireland 
fDepartment of Materials, Alliance Manchester Business School, United Kingdom of Great Britain and Norther

## .docx documents

In [1]:
!pip install docx2txt

Defaulting to user installation because normal site-packages is not writeable
Collecting docx2txt
  Downloading docx2txt-0.8.tar.gz (2.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: docx2txt
  Building wheel for docx2txt (setup.py) ... [?25ldone
[?25h  Created wheel for docx2txt: filename=docx2txt-0.8-py3-none-any.whl size=3980 sha256=c5f25e16ad2a70425963819d1c6e484d1f103b61d363fee887372dfd6ccc7038
  Stored in directory: /home/kin/.cache/pip/wheels/22/58/cf/093d0a6c3ecfdfc5f6ddd5524043b88e59a9a199cb02352966
Successfully built docx2txt
Installing collected packages: docx2txt
Successfully installed docx2txt-0.8


In [4]:
import docx2txt

# Function to extract text from .docx file
def extract_text_from_docx(docx_file):
    try:
        text = docx2txt.process(docx_file)
        return text
    except Exception as e:
        print("Error:", str(e))
        return None

# Function to extract text from .txt file
def extract_text_from_txt(txt_file):
    try:
        with open(txt_file, 'r', encoding='utf-8') as file:
            text = file.read()
        return text
    except Exception as e:
        print("Error:", str(e))
        return None

# Main function
docx_file_path = locate_child_file('Txt', 'sample.docx')

# Extract text from .docx file
docx_text = extract_text_from_docx(docx_file_path)
if docx_text:
    print("Text from .docx file:")
    print(docx_text)



Text from .docx file:
In December 2019 the International Red Cross and Red Crescent Movement will meet in Geneva as it does every four years for our International Conference. Once again, we will gather under a banner promoting the power of humanity. But what is humanity and what is the power behind this slogan?

Humanity means three different things: a species; a behaviour, and a global identity. The historical relationship between these different dimensions of humanity has been elegantly discussed by the late Bruce Mazlish in his 2009 book The Idea of Humanity in a Global Era and it is important to distinguish between these three aspects of being human as we prepare to meet as a global humanitarian movement once again.



Humanity as species

The first meaning of humanity describes a particular kind of animal that biologists encouragingly call homo sapiens – or wise human – and which seems distinct from all other animals because of its powers of language, reasoning, imagination and te