# Russian lemmatizer

This Jupyter notebook takes a text file, or folder of text files in Russian, and creates a set of lemmatized derivative files (where all words are in their dictionary form, and not inflected). These lemmatized files can then be used for searching, or other computational text analysis methods.

### Mounting Google drive
Run the code cell below. You'll get a link that will take you to a screen where you can choose which Google account to authenticate with. After you choose an account and approve the connection, you'll get a long string of numbers and letters. Copy and paste it into the box that will appear in the code cell below, and hit enter.

If you see `Mounted at /content/gdrive`, then you know it worked.

In [None]:
#@title Mount Google Drive { vertical-output: true }
from google.colab import drive

drive.mount('/content/gdrive')

## Setup
This code cell below installs the *pymystem3* package which can do the actual lemmatizing. It also downloads and installes some files required for the package to work.

After you finish running it, you should see the word *mystem* if everything worked correctly.

In [None]:
#@title Install pymystem3 and related files
#Imports the sys module which you can use to perform system-level things 
#like installing new modules
import sys

#Installs pymystem3
!{sys.executable} -m pip install pymystem3

!wget http://download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz
!tar -xvf mystem-3.0-linux3.1-64bit.tar.gz
!cp mystem /bin/mystem

#os is used for navigating your filesystem and getting to the files you want to lemmatize
import os

#pymystem3 is the module that does the actual lemmatizing
from pymystem3 import Mystem

#loads mystem
mystem = Mystem()

## If you only want to lemmatize one file, put in the full path to your text file
**Skip down to the next section if you want to lemmatize a whole folder**

After uploading the text file to Drive, put in the full path to where it's located. All paths should begin with /content/gdrive/My Drive/. 

If your file is located just in the root of your Drive (not in a subfolder), the path should look like /content/gdrive/My Drive/your-text-file.txt

If it's in a subfolder, it should look like: /content/gdrive/My Drive/subfolder/your-text-file.txt (you can also include multiple subfolders if it's nested)

If the lemmatization works correctly, you should see the message *File lemmatized! Check your Google Drive folder.*

In [None]:
#@title Enter the full path to a single text (.txt) file
filepath = "/content/gdrive/My Drive/myfolder/mytextfile.txt" #@param {type:"string"}

#The outname is the name of the lemmatized file that this notebook creates
#If you want it to be named something other than the original file name + -lemmatized
#you can change that here
outname = filepath.replace('.txt', '-lemmatized.txt')
#Opens the file you specified
with open(filepath, 'r', encoding='utf8') as f:
    #Creates an empty text file with -lemmatized.txt appended to the name
    with open(outname, 'w', encoding='utf8') as out:
        #Reads the text of the file you specified
        text = f.read()
        #Lemmatizes the text
        tokens = mystem.lemmatize(text)
        #Creates a Python list of the lemmas
        tokens = [token for token in tokens]
        #Changes the lemmas from a Python list to a text string
        #Lemmas are separated by a space
        lemmatized = "".join(tokens)
        #Writes the lemmas to the text file with -lemmatized appended to the name
        out.write(lemmatized)
        print('File lemmatized! Check your Google Drive folder.')

## If want to lemmatize an entire folder of files, put the path to the folder here
*Please include only paths to folders that contain text (.txt) files. If you have a folder that contains sub-folders of text files.*

Paths should look like: '/content/gdrive/My Drive/subfolder' (you can also include multiple subfolders if it's nested)

In [None]:
#@title Enter the full path to a folder full of text (.txt) files
textfolder = "/content/gdrive/My Drive/myfolder" #@param {type:"string"}
#Changes the working directory to the folder you specified
os.chdir(textfolder)
#For every file in the folder you specified...
for filename in os.listdir(textfolder):
    #If it's a text file, but not one of the text files with just lemmas
    if filename.endswith('.txt') and not filename.endswith('-lemmatized.txt'):
        #The outname is the name of the lemmatized file that this notebook creates
        #If you want it to be named something other than the original file name + -lemmatized
        #you can change that here
        outname = filename.replace('.txt', '-lemmatized.txt')
        #Opens the file you specified
        with open(filename, 'r', encoding='utf8') as f:
            #Creates an empty text file with -lemmatized.txt appended to the name
            with open(outname, 'w', encoding='utf8') as out:
                #Reads the text of the file you specified
                text = f.read()
                #Lemmatizes the text
                tokens = mystem.lemmatize(text)
                #Creates a Python list of the lemmas
                tokens = [token for token in tokens]
                #Changes the lemmas from a Python list to a text string
                #Lemmas are separated by a space
                lemmatized = "".join(tokens)
                #Writes the lemmas to the text file with -lemmatized appended to the name
                out.write(lemmatized)
print('Your folder now has lemmatized files!')

## About

This Jupyter notebook was developed by Quinn Dombrowski as part of the [Multilingual DH Russian Starter Kit](https://github.com/multilingual-dh/russian-starter-kit).