# Doing Text Analysis with Azure Cognitive Services Text Analytics

## Overview
This notebook goes through the basics of writing a custom Cognitive Services Text Analytics application for analyzing long, open-form text. 
Specifically, we'll get key phrases and common entities (e.g. organizations) from each text input. 

The full SDK for Cognitive Services Text Analytics with Python can be found [here](https://azuresdkdocs.blob.core.windows.net/$web/python/azure-cognitiveservices-language-textanalytics/0.2.0/azure.cognitiveservices.language.textanalytics.html#azure.cognitiveservices.language.textanalytics.TextAnalyticsClient.key_phrases).

### How to use this notebook
This notebook contains all instructions and code needed to run text analytics on multiple text files. **For the code to function properly, you'll need to change two things:**

1. Cognitive Services key, and
2. Cognitive services endpoint.

There are two ways to run the notebook code:

1. Go through each section and click the green arrow <span style="color:green">&#9655</span> on the top, left-hand side of each code block; or
2. Run all cells by clicking the double-arrow icon at the very top of the notebook.
 
### Let's get started!

## More information on Text Data
This notebook takes a set of text stored in the same folder. These input text files can be different lengths, but distinct sentences or paragraphs should be separated by a new line.

You may run this notebook on the sample text provided, which include text copied from Wikipedia articles. Each text file contains content from a single Wikipedia article, with the file name indicating the name of the article.

Alternatively, you may upload your own text files into the **TextFiles** folder. How you organize the text into different files will depend on your particular needs. Here is an example scenario for handling open-form survey responses:

1. Copy and paste all responses to a single question into a unique text file. 
    * Copying and pasting from Excel rows works great, as it will include a newline between each unique response.
2. Save each text file in a subfolder called "TextFiles" (or update the filepath in the code block below where it says "To Do").

### Loading the Data.
Run the code below by clicking the <span style="color:green">&#9655</span> to import and process the text files. Read through the comments for more details. 

**Troubleshooting**

If you run into any issues as you run the code blocks, here are some troubleshooting tips that might be helpful:
* The Cognitive Services Text Analytics client requires the following input format:
    * Document size less than 5,120 characters (this is handled in the import data section)
    * Input text should be in the following format: 
        ```text = {'id': Unique_Id, 'text': Input_Text}```
    * You may input a list of (properly formatted) dictionaries. Note that the Text Analytics client will output attributes (e.g. key phrases) for each input dictionary.
* An "HttpOperationErrtext analytics client that the senot ntiment method did get a correctly formatted input. Some things to check:
    * The document input contains unique ids fCheck the format of your input and make sure it matches the dictionary format above.nput
    * Inputting 
* A "list index out of range" error indicates that the input text is too long for the Text Anal.
* Run into other issues? Please let us know by opening an issue on the [GitHub repo](https://github.com/jenfoxbot/text-analytics-walkthrough).ytics client to handle

In [None]:
import os
# Note: your text files may have a different encoding than the one below
enc = 'utf-8'

# Read the text in the External_data/TextFilesfolder
# TO DO: Update file path as necessary
text_folder = os.path.join('TextFiles')

# Create a collection of reviews with id (file name) and text (contents) properties
texts = []
for file_name in os.listdir(text_folder):
    survey_text = open(os.path.join(text_folder, file_name), encoding=enc).read()
    
    # Counter to track number of times split is done 
    text_split_counter = 0
    survey_text_list = []

    if len(survey_text) > 5100:
        while len(survey_text) > 5100:
            for char in (0,5100):
                # Get the first 5100 characters
                text_holder = survey_text[0:5100]
                # Go back to the last new line and save in a list. Save index of split
                for item in range(len(text_holder[::-1])):
                    #print(text_holder[item])
                    if text_holder[item] == "\n":
                        # Store index of last newline
                        index = item
                        # Save list up to the last newline
                        split_text = text_holder[0:index]
            #Save split text
            text = {"id": f"{file_name}.{text_split_counter}", "text": split_text}
            texts.append(text)
            # Update counter
            text_split_counter += 1
            # Remove split text from survey_text 
            survey_text = survey_text[index:len(survey_text)]
        # Save split text and update text list        
        text = {"id": f"{file_name}.{text_split_counter}", "text": survey_text}
        texts.append(text)
    
    else:
        text = {"id": file_name, "text": survey_text}
        texts.append(text)

for text_num in range(len(texts)):
    # print the open-form text
    print(texts[text_num]['id'])
    # OPTIONAL: To see all text, uncomment the following 2 lines
    #for i in range(number_of_lines):
        #print('{}\n{}\n'.format(texts[text_num]['id'], texts[text_num]['text']))  

## Get the Key and Endpoint for your Cognitive Services resource
This notebook assumes you already have a Cognitive Services resource in your Azure subscription. (If not, follow the instructions in **Step 3** the [Read Me of this github repo](https://github.com/microsoft/text-analytics-walkthrough).)

1. In another browser tab, open the Azure portal at https://portal.azure.com 
2. Select your cognitive services resource. On the **Overview** page, click on **"Keys and endpoint"** in the menu on the left-hand side (under Resource Management).
3. Copy the **Key1** for your resource and paste it in the code below, replacing **YOUR_COG_KEY**.
2. Copy the **endpoint** for your resource and and paste it in the code below, replacing **YOUR_COG_ENDPOINT**.
3. Run the code in the cell below by clicking its green <span style="color:green">&#9655</span> button.


In [None]:
# TO DO: Fill in your cognitive services key and endpoint
cog_key = 'YOUR_COG_KEY'
cog_endpoint = 'YOUR_COG_ENDPOINT'

print('Ready to use cognitive services at {} using key {}'.format(cog_endpoint, cog_key))

## Install Azure Cognitive Services Text Analytics SDK
Run the code below to install the text analytics SDK onto the compute virtual machine.

In [None]:
! pip install azure-cognitiveservices-language-textanalytics

## Import Cognitive Services libraries and create a Text Analytics client

Run the following code to import the Cognitive Services Text Analytics library. We'll also create a client for the text analytics cognitive services resource, which takes in two inputs: your endpoint and key that we set above.

In [None]:

from azure.cognitiveservices.language.textanalytics import TextAnalyticsClient
from msrest.authentication import CognitiveServicesCredentials

# Get a client for your text analytics cognitive service resource
text_analytics_client = TextAnalyticsClient(endpoint=cog_endpoint,
                                            credentials=CognitiveServicesCredentials(cog_key))

## Extract Key Phrases
We're now ready to start using the text analytics service! First, let's get a list of key phrases from our set of texts. This helps give some indication of common themes and trends without us having to read all of the text.

This code block outputs  key phrases for each set of texts and stores them in a text file. 
It also prints the file name, basic statistics for each text file (e.g. character count), and the number of key phrases.

### More Info
* The key phrases service processes each input as-a-whole. This means that key phrases are extracted based on the entire input text.
* The number of returned key phrases is proportional to the size of the input text.
* The SDK documentation for this service can be found [here](https://azuresdkdocs.blob.core.windows.net/$web/python/azure-cognitiveservices-language-textanalytics/0.2.0/azure.cognitiveservices.language.textanalytics.html#azure.cognitiveservices.language.textanalytics.TextAnalyticsClient.key_phrases).

In [None]:
# Load in text file with responses
key_phrase_analysis = text_analytics_client.key_phrases(documents=texts, show_stats=True)

# print and store key phrases for each text file
for text_num in range(len(texts)):
    # Open file with file header name given by 
    file_header = texts[text_num]['id']
    
    # print the review id
    print(texts[text_num]['id'])

    # Print stats for the file
    stats = key_phrase_analysis.documents[text_num].statistics
    print('\nStatistics: ')
    print(stats)

    # Get the key phrases in this review
    print('\nKey Phrases:')
    key_phrases = key_phrase_analysis.documents[text_num].key_phrases
    print('\nNumber of Key Phrases')
    print(len(key_phrases))

    # Print each key phrase
    for key_phrase in key_phrases:
        print('\t', key_phrase)
    print('\n')

## Extract Known Entities

Next, let's look for common entities in our text responses. *Entities* are things that reference some commonly understood type of item. 
For example, a location, a person, or an organization. The following code block pulls out "Organization" and "Other" entities. 

A full list of supported entities can be found in the [Azure Cognitive Services Python SDK documentation here](https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/named-entity-types?tabs=general)

### More Info
Some entities are sufficiently well-known to have an associated Wikipedia page. The Text Analytics service can also return the URL for that page. 
To enable this, uncomment the line below the code comment that says "Get Wikipedia link".

In [None]:
# Use the client and reviews you created previously to get named entities
entity_analysis = text_analytics_client.entities(documents=texts)

# Print the results for each review
for review_num in range(len(texts)):
    print(texts[review_num]['id'])
    # Get the named entitites in this review
    entities = entity_analysis.documents[review_num].entities

    for entity in entities:
        # Only get Organization and Other entitites
        # Add or remove entities to this list according to needs/interests
        if entity.type in ['Organization','Other']:
            # Get Wikipedia link
            #link = '(' + entity.wikipedia_url + ')' if entity.wikipedia_id is not None else ''
            print(' - {}: {}'.format(entity.type, entity.name))

## Going Further

Congratulations! You did machine learning on text to get some (hopefully) helpful insights! There are other services available to you, including sentiment analysis and language detection. For more information on those services, check out the Microsoft Docs [Python Cognitive Services Text Analytics SDK](https://docs.microsoft.com/en-us/python/api/azure-cognitiveservices-language-textanalytics/azure.cognitiveservices.language.textanalytics.text_analytics_client.textanalyticsclient?view=azure-python) or 

You can also use Azure Cognitive Services for all sorts of other things like Computer Vision, Anomaly Detection, and Speech Recognition! [Here's a handy overview](https://docs.microsoft.com/en-us/python/api/overview/azure/cognitive-services?view=azure-python) that covers what else you can do in Python.

Questions? Requests? Let us know! Open a Pull Request on our repo or send us an e-mail: AskAMaker@microsoft.com

Thanks for reading!