# Doing Text Analysis with Azure Cognitive Services Text Analytics

## Overview
This notebook goes through the basics of writing a custom Cognitive Services Text Analytics application for analyzing long, open-form text. 
Specifically, we'll get key phrases and common entities (e.g. organizations) from each text input. 

Documentation for how to use Cognitive Services Text Analytics with Python can be found [here (SDK V3)](https://docs.microsoft.com/en-us/python/api/azure-ai-textanalytics/azure.ai.textanalytics.textanalyticsclient?view=azure-python).

### How to use this notebook
This notebook contains all instructions and code needed to run text analytics on multiple text files. **For the code to function properly, you'll need to add two things in an untracked file:**

1. Cognitive Services key, and
2. Cognitive services endpoint.

Remember to treat these Azure credentials like passwords: keep them private and secure!

**There are two ways to run the notebook code:**

1. Go through each section and click the green arrow <span style="color:green">&#9655</span> on the top, left-hand side of each code block; or
2. Run all cells by clicking the double-arrow icon at the very top of the notebook.
 

### Let's get started!
Read through the section below to learn more about Cognitive Services Text Analytics. Then run the code block below to import and process the sample text files stored in the *TextFiles* folder.

To process your own text, replace the sample text files in the *TextFiles* folder with your own .txt files.

## More information on Text Data
This notebook takes a set of text stored in the same folder. These input text files can be different lengths, but distinct sentences or paragraphs should be separated by a new line.

You may run this notebook on the sample text provided, which include text copied from Wikipedia articles. Each text file contains content from a single Wikipedia article, with the file name indicating the name of the article.

Alternatively, you may upload your own text files into the **TextFiles** folder. How you organize the text into different files will depend on your particular needs. Here is an example scenario for handling open-form survey responses:

1. Copy and paste all responses to a single question into a unique text file. 
    * Copying and pasting from Excel rows works great, as it will include a newline between each unique response.
2. Save each text file in a subfolder called "TextFiles" (or update the filepath in the code block below where it says "To Do").

**Troubleshooting**

If you run into any issues as you run the code blocks, here are some troubleshooting tips that might be helpful:
* The Cognitive Services Text Analytics Python SDK V3 requires the following input formats:
    * Raw Text Input
    * Encoding: UTF-8 or UTF-16
    * Document size less than 5,120 characters (this is handled in the import data section)
    * Some methods (e.g. entity recognition) require a batch size of 5 or less.
    * Input text should be in one of the following formats:
        * a list of strings: 
        
            ```list[str]```
        * a list of text documents: 
        
            ```list[TextDocumentInput]```
        * or a list of dictionary representations with at least two string elements, ID and text:

             ```list[dict[str, str]]``` 
             
             E.g. ```texts = [{'id': Unique_Id1, 'text': Input_Text1}, {'id': Unique_ID2, 'text': Input_Text2}, ...]```
    * If you wish to specify the ID and language on a per-item basis you must use as input a list of text documents or a list of dict representations. 
    * [More info on data limits here](https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/overview#data-limits)
* An "HttpOperationErrText", "HttpResponseError" or similar error indicates that the Text Analytics method did not get a correctly formatted input. Some things to check:
    * Batch size limits.
    * Iterating through multiple text files such that the method inputs fit one of the three options above.
    * Any list of dictionary inputs contains unique ids for each dictionary. 
* A "list index out of range" error indicates that the input text is too long for the Text Analytics client to handle.
* Run into other issues? Please let us know by opening an issue on the [GitHub repo](https://github.com/jenfoxbot/text-analytics-walkthrough).

In [127]:
# This code imports text files stored in 'TextFiles' folder and splits them 
# into dictionaries of fewer than 5,100 characters at the last new line.
# The final dictionary names are printed to the screen.

import os
# Note: your text files may have a different encoding than the one below
enc = 'utf-8'

# Import the text files in the TextFiles folder
# TO DO: Update file path as necessary
text_folder = os.path.join('TextFiles')

# Create a collection of texts with id (file name) and text (contents) properties
texts = []
for file_name in os.listdir(text_folder):
    survey_text = open(os.path.join(text_folder, file_name), encoding=enc).read()
    
    # Counter to track number of times split is done 
    text_split_counter = 0
    survey_text_list = []

    if len(survey_text) > 5100:
        while len(survey_text) > 5100:
            for char in (0,5100):
                # Get the first 5100 characters
                text_holder = survey_text[0:5100]
                # Go back to the last new line and save in a list. Save index of split
                for item in range(len(text_holder[::-1])):
                    #print(text_holder[item])
                    if text_holder[item] == "\n":
                        # Store index of last newline
                        index = item
                        # Save list up to the last newline
                        split_text = text_holder[0:index]
            #Save split text
            text = {"id": f"{file_name}.{text_split_counter}", "text": split_text}
            texts.append(text)
            # Update counter
            text_split_counter += 1
            # Remove split text from survey_text 
            survey_text = survey_text[index:len(survey_text)]
        # Save split text and update text list        
        text = {"id": f"{file_name}.{text_split_counter}", "text": survey_text}
        texts.append(text)
    
    else:
        text = {"id": file_name, "text": survey_text}
        texts.append(text)

for text_num in range(len(texts)):
    # print the open-form text
    print(texts[text_num]['id'])
    # OPTIONAL: To see all text, uncomment the following 2 lines
    #for i in range(number_of_lines):
        #print('{}\n{}\n'.format(texts[text_num]['id'], texts[text_num]['text']))  

Wikipedia-A-History-of-the-Birds-of-Europe.txt.0
Wikipedia-A-History-of-the-Birds-of-Europe.txt.1
Wikipedia-Hypatia.txt.0
Wikipedia-Hypatia.txt.1
Wikipedia-Hypatia.txt.2
Wikipedia-Indigeneous-Peoples-of-the-Americas.txt.0
Wikipedia-Indigeneous-Peoples-of-the-Americas.txt.1
Wikipedia-Indigeneous-Peoples-of-the-Americas.txt.2


## Get the Key and Endpoint for your Cognitive Services resource
This notebook assumes you already have a Cognitive Services resource in your Azure subscription. (If not, follow the instructions in **Step 3** the [Read Me of this github repo](https://github.com/microsoft/text-analytics-walkthrough).)

To use Azure Cognitive Services, you'll need a key, similar to password, and an endpoint, which accesses the Cognitive Services resource you created. It's important to keep both of these private and secure! 

### A. Create a private file to store your Azure key and endpoint.
The code block below installs a Python library called *dotenv*, which allows you to read environment variables from a file. You can store your Azure key and endpoint in this (untracked) file, so if you fork this project your Azure credentials  remain private.

Run the code below to install the library and create the *.env* file where you'll store your key and endpoint.

*Note: If you're on a Linux or Unix machine, use the appropriate command line calls.*

In [128]:
# WINDOWS: Use the following commands
!pip install python-dotenv

!echo YOUR_COG_KEY= > .env
!echo YOUR_COG_ENDPOINT= >> .env



You should consider upgrading via the 'c:\users\admin\onedrive - microsoft\git\text-analytics-walkthrough\.venv\scripts\python.exe -m pip install --upgrade pip' command.


### B. Get your Azure key and endpoint and store them in the *.env* file.

1. In VS Code, open the *.env* file you just created. You should see two (blank) variables: *YOUR_COG_KEY* and *YOUR_COG_ENDPOINT*. 
1. In a browser window, open the [Azure portal](https://portal.azure.com).
2. Select your cognitive services resource. On the **Overview** page, click on **"Keys and endpoint"** in the menu on the left-hand side (under Resource Management).
1. Copy the **Key1** for your resource and paste it into the *.env* file for **YOUR_COG_KEY** after the equals sign and **between quotes**, without any spacing, like so:
    ```YOUR_COG_KEY='COG_KEY_HERE'```
2. Copy the **endpoint** for your resource and and paste into the *.env* file for **YOUR_COG_ENDPOINT**:
    ```YOUR_COG_ENDPOINT='COG_ENDPOINT_HERE'```
3. Run the code in the cell to load the variables from the *.env* file into this notebook environment.

In [129]:
# Import dotenv library functions and load in Azure credentials from .env file, 
# then print resulting variables to check for accuracy.
from dotenv import load_dotenv
load_dotenv('./.env')
os.getenv("CONNECTION_STRING")

cog_key = os.getenv('YOUR_COG_KEY')
cog_endpoint = os.getenv('YOUR_COG_ENDPOINT')

print('Ready to use cognitive services at {} using key {}'.format(cog_endpoint, cog_key))

## Install Azure Cognitive Services Text Analytics SDK
Run the code below to install the text analytics SDK onto your local environment.

In [130]:
!pip install azure-ai-textanalytics --pre

You should consider upgrading via the 'c:\users\admin\onedrive - microsoft\git\text-analytics-walkthrough\.venv\scripts\python.exe -m pip install --upgrade pip' command.


## Import Cognitive Services libraries and create a Text Analytics client

Run the following code to import the Cognitive Services Text Analytics library. We'll also create a client for the text analytics cognitive services resource, which takes in two inputs: your key and endpoint that we set above.

In [131]:
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

def authenticate_client():
    ta_credential = AzureKeyCredential(cog_key)
    text_analytics_client = TextAnalyticsClient(
            endpoint=cog_endpoint, 
            credential=ta_credential)
    return text_analytics_client

client = authenticate_client()

In [132]:
from azure.cognitiveservices.language.textanalytics import TextAnalyticsClient
from msrest.authentication import CognitiveServicesCredentials

# Get a client for your text analytics cognitive service resource
text_analytics_client = TextAnalyticsClient(endpoint=cog_endpoint,
                                            credentials=CognitiveServicesCredentials(cog_key))

## Extract Key Phrases
We're now ready to start using the text analytics service! First, let's get a list of key phrases from our set of texts. This helps give some indication of common themes and talking points without us having to read all of the text.

When you run the code block below, it outputs key phrases for each set of texts. 
It also prints the file name, basic statistics for each text file (e.g. character count), and the number of key phrases.

### More Info
* The key phrases service processes each input as-a-whole. This means that key phrases are extracted based on the entire input text.
* The number of returned key phrases is proportional to the size of the input text.
* The SDK documentation for this service can be found [here](https://docs.microsoft.com/en-us/python/api/azure-cognitiveservices-language-textanalytics/azure.cognitiveservices.language.textanalytics.textanalyticsclient?view=azure-python#key-phrases-show-stats-none--documents-none--custom-headers-none--raw-false----operation-config-).

In [133]:
# Load in text files and do key phrase extraction on each file
key_phrase_analysis = client.extract_key_phrases(documents=texts, show_stats=True)

#For each text file, get and print key phrases and document statistics
for doc in key_phrase_analysis:
    if not doc.is_error:
        # Print the document title
        print(doc['id'])
        # Print stats for each file
        stats = doc.statistics
        print('\nStatistics: {}'.format(stats))
        
        # Get the key phrases in this review
        key_phrases = doc.key_phrases
        print('\nNumber of Key Phrases: {}'.format(len(key_phrases)))
        # Print each key phrases on a new line
        for key_phrase in key_phrases:
            print('\t', key_phrase)
        print('\n')

    if doc.is_error:
        print(doc.id, doc.error)

Wikipedia-A-History-of-the-Birds-of-Europe.txt.0

Statistics: {'character_count': 4171, 'transaction_count': 5}

Number of Key Phrases: 107
	 John Gould's five-volume Birds of Europe
	 then-known birds
	 bird species
	 Ornithology of Francis Willughby of Middleton
	 Dresser's extensive collection of birds
	 John Wilkins
	 Books of Ornithology
	 modern ornithology
	 early works
	 pioneering ornithological work of John Ray
	 anatomical features
	 Francis Bacon
	 classification of specimens
	 quarto parts
	 Early ornithologies
	 effective classification system
	 innovative features
	 dichotomous key
	 century
	 English-language ornithologies
	 cost
	 English Royal Society
	 History
	 readers
	 Dutch artist John Gerrard Keulemans
	 Richard Bowdler Sharpe
	 worldwide distribution
	 advancement of knowledge
	 earlier volumes
	 first-hand knowledge
	 modern zoology
	 volume ornithological book
	 Dresser's outdated views
	 Zoological Society of London
	 Henry Eeles Dresser
	 Mathurin Jacques B

## Extract Known Entities

Next, let's look for common entities in our text responses. *Entities* are things that reference some commonly understood type of item. 
For example, a location, a person, or an organization. The following code block pulls out "Organization", "Person", "Location, and "Other" entities. 

A full list of supported entities can be found [here](https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/named-entity-types?tabs=general).

### More Info
Some entities are sufficiently well-known to have an associated Wikipedia page. The Text Analytics service can also return the URL for that page using the "recognize_linked_entities" method. For more information, [check out the Python SDK method documentation here](https://docs.microsoft.com/en-us/python/api/azure-ai-textanalytics/azure.ai.textanalytics.textanalyticsclient?view=azure-python#recognize-entities-documents----kwargs-)

In [135]:
# Use the client and reviews you created previously to get named entities
# Set batch size to be 5 (max input for recognize_entities)
batch_size = 5

for i in range(0, len(texts), batch_size):
    # Create subset of dictionaries
    batch = texts[i:i+batch_size] 
    #Run entity analysis on sets of 5 or fewer inputs 
    entity_analysis = client.recognize_entities(batch)
    
    # Create a list of the entity analysis results
    docs = [doc for doc in entity_analysis if not doc.is_error]

    # Iterate through texts and print entity analysis results
    for idx, doc in enumerate(docs):
        # Print text ID 
        print("\n" + texts[idx]['id'])
        for entity in doc.entities:
            # Only get Organization and Other entitites
            # TO DO: Add or remove entities to this list according to needs/interests
            if entity.category in ['Organization','Person', 'Location', 'Other']:
                print(' - {}: {} (Confidence: {})'.format(entity.category, entity.text, entity.confidence_score))


Wikipedia-A-History-of-the-Birds-of-Europe.txt.0
 - Person: Henry Eeles Dresser (Confidence: 0.91)
 - Person: Richard Bowdler Sharpe (Confidence: 0.88)
 - Location: Europe (Confidence: 0.58)
 - Person: Ray (Confidence: 0.57)
 - Person: Francis Willughby (Confidence: 0.79)
 - Person: John Gould (Confidence: 0.88)
 - Location: Europe (Confidence: 0.4)
 - Person: Sharpe (Confidence: 0.79)
 - Organization: Zoological Society of London (Confidence: 0.76)
 - Person: Gould (Confidence: 0.79)
 - Location: Europe (Confidence: 0.71)
 - Location: North Africa (Confidence: 0.73)
 - Location: Middle East (Confidence: 0.42)
 - Location: Atlantic archipelagos (Confidence: 0.49)
 - Location: Madeira (Confidence: 0.4)
 - Location: Canary Islands (Confidence: 0.46)
 - Location: Azores (Confidence: 0.5)
 - Person: Dresser (Confidence: 0.38)
 - Person: Dresser (Confidence: 0.3)
 - Person: Gerrard Keulemans (Confidence: 0.49)
 - Person: Sharpe (Confidence: 0.61)
 - Location: Europe (Confidence: 0.45)
 - P

## Going Further

Congratulations! You did machine learning on text to get some (hopefully) helpful insights!  

There are other services available to you, including sentiment analysis and language detection. For more information on those services, check out the Microsoft Docs [Python Cognitive Services Text Analytics SDK](https://docs.microsoft.com/en-us/python/api/azure-ai-textanalytics/azure.ai.textanalytics.textanalyticsclient?view=azure-python). Fun fact: you can also do [sentiment analysis in Excel](https://www.mrexcel.com/excel-tips/sentiment-analysis/)!

You can also use Azure Cognitive Services for all sorts of other things like Computer Vision, Anomaly Detection, and Speech Recognition! [Here's a handy overview](https://docs.microsoft.com/en-us/python/api/overview/azure/cognitive-services?view=azure-python) that covers what else you can do in Python.

Questions? Requests? Let us know! Open a Pull Request on our repo or send us an e-mail: AskAMaker@microsoft.com

Thanks for reading!