In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/character-encoding-examples/file_guide.csv
/kaggle/input/character-encoding-examples/olaf_Windows-1251.txt
/kaggle/input/character-encoding-examples/shisei_UTF-8.txt
/kaggle/input/character-encoding-examples/die_ISO-8859-1.txt
/kaggle/input/character-encoding-examples/harpers_ASCII.txt
/kaggle/input/character-encoding-examples/yan_BIG-5.txt
/kaggle/input/character-encoding-examples/portugal_ISO-8859-1.txt


In [2]:
# import libraries
import numpy as np
import pandas as pd
import chardet # for encoding

In [3]:
# Open file_guide.csv file as it contains information about the text files
info_csv = pd.read_csv('../input/character-encoding-examples/file_guide.csv')
info_csv

Unnamed: 0,File,Text,Author,Encoding,Language,Words
0,die_ISO-8859-1.txt,Die Fürstin,Kasimir Edschmid,ISO-8859-1,German,13314
1,harpers_ASCII.txt,"Harper's Round Table, October 8, 1895",Various,ASCII,English,29094
2,olaf_Windows-1251.txt,Olaf van Geldern,Pencho Slaveykov,Windows 1251,Bulgarian,2790
3,portugal_ISO-8859-1.txt,"Portugal enfermo por vicios, e abusos de ambos...",José Daniel Rodrigues da Costa,ISO-8859-1,Portuguese,14215
4,shisei_UTF-8.txt,Shisei,Junichiro Tanizaki,UTF-8,Japanese,4809
5,yan_BIG-5.txt,Yan shi jia xun,Yan Zhitui,BIG-5,Chinese,2538


Here, we can see that all the five text files have different encoding styles. Lets encode the first text and see how it works

In [4]:
# using 'with open' to open text file
with open('../input/character-encoding-examples/die_ISO-8859-1.txt', 'rb') as text_data:
    text_file = text_data.read(200)
print(text_file)

b'The Project Gutenberg EBook of Die F\xfcrstin, by Kasimir Edschmid\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away o'


Here, we can see that some characters doesn't make sense. So, it is essential to change the encoding in appropirate type. For that, we need to find the encoding type of the required text file.

In [5]:
with open('../input/character-encoding-examples/die_ISO-8859-1.txt', 'rb') as text_data:
    # using chardet library to detect the encoding type taking first 1000 words.
    encode_type = chardet.detect(text_data.read(1000))
print(encode_type)

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}


From above we can see that the encoding type is ISO-8859-1. Therefore, we can use it to open the text file in appropriate format.

In [6]:
with open('../input/character-encoding-examples/die_ISO-8859-1.txt', encoding = 'ISO-8859-1') as text_data:
    #read the file
    text_file = text_data.read(2000)
print(text_file)

The Project Gutenberg EBook of Die Fürstin, by Kasimir Edschmid

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net


Title: Die Fürstin

Author: Kasimir Edschmid

Release Date: May 15, 2010 [EBook #32385]

Language: German

Character set encoding: ISO-8859-1

*** START OF THIS PROJECT GUTENBERG EBOOK DIE FÜRSTIN ***




Produced by Jens Sadowski




Transcriber's Note:
Double quotation marks have been encoded as » and «.




KASIMIR EDSCHMID

DIE FÜRSTIN





1920

PAUL CASSIRER VERLAG · BERLIN



ALLE RECHTE VORBEHALTEN

COPYRIGHT 1920 BY PAUL CASSIRER · BERLIN






GESCHRIEBEN NEUNZEHNHUNDERTSECHZEHN




INHALT

DAS FRAUENSCHLOSS
JAEL
DIE ABENTEUERLICHE NACHT
BRIEF
TRAUM









DAS FRAUENSCHLOSS


DIE Drachenköpfe unserer Boote bogen um das gelbe Segel. Die Parade vollzog
sich in elega

Now, as we used correct encoding type, we were able to open the text file without any erros in the characters.

Lets create a function which can convert the correct encoding type of any given text file and open it correctly

In [7]:
def encoding_tool(first_file_name):
    # using concatination of the string as to satisfy the input path of the given file in kaggle
    file_name = '../input/character-encoding-examples/' + first_file_name
    with open(file_name, 'rb') as text_data:
        detect_file = chardet.detect(text_data.read())
    encoding_type = detect_file.get('encoding')    
    with open(file_name, encoding = encoding_type) as final_data:
        result = final_data.read(1500)
    print(result)

In [8]:
# Lets use the first text file again to open using the function
initial_text = 'die_ISO-8859-1.txt'
german_text = encoding_tool(initial_text)
print(german_text)

The Project Gutenberg EBook of Die Fürstin, by Kasimir Edschmid

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net


Title: Die Fürstin

Author: Kasimir Edschmid

Release Date: May 15, 2010 [EBook #32385]

Language: German

Character set encoding: ISO-8859-1

*** START OF THIS PROJECT GUTENBERG EBOOK DIE FÜRSTIN ***




Produced by Jens Sadowski




Transcriber's Note:
Double quotation marks have been encoded as » and «.




KASIMIR EDSCHMID

DIE FÜRSTIN





1920

PAUL CASSIRER VERLAG · BERLIN



ALLE RECHTE VORBEHALTEN

COPYRIGHT 1920 BY PAUL CASSIRER · BERLIN






GESCHRIEBEN NEUNZEHNHUNDERTSECHZEHN




INHALT

DAS FRAUENSCHLOSS
JAEL
DIE ABENTEUERLICHE NACHT
BRIEF
TRAUM









DAS FRAUENSCHLOSS


DIE Drachenköpfe unserer Boote bogen um das gelbe Segel. Die Parade vollzog
sich in elega

In [9]:
initial_text = 'harpers_ASCII.txt'
eng_text = encoding_tool(initial_text)
print(eng_text)

Project Gutenberg's Harper's Round Table, October 8, 1895, by Various

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Harper's Round Table, October 8, 1895

Author: Various

Release Date: July 14, 2010 [EBook #33158]

Language: English

Character set encoding: ASCII

*** START OF THIS PROJECT GUTENBERG EBOOK HARPER'S ROUND TABLE ***




Produced by Annie McGuire








[Illustration: HARPER'S ROUND TABLE]

Copyright, 1895, by HARPER & BROTHERS. All Rights Reserved.

       *       *       *       *       *

PUBLISHED WEEKLY. NEW YORK, TUESDAY, OCTOBER 8, 1895. FIVE CENTS A COPY.

VOL. XVI.--NO. 832. TWO DOLLARS A YEAR.

       *       *       *       *       *




[Illustration]

THE COPPERTOWN "STAR" ROUTE.

BY W. G. VAN TASSEL SUTPHEN.


The Happy Thought, as will be remember

In [10]:
initial_text = 'olaf_Windows-1251.txt'
bul_text = encoding_tool(initial_text)
print(bul_text)

The Project Gutenberg EBook of Olaf van Geldern, by Pencho Slaveykov
(#1 in our series by Pencho Slaveykov, note that #2 is our etext #3433,
with a September 2002 release date.)

Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.

This header should be the first thing seen when viewing this Project
Gutenberg file.  Please do not remove it.  Do not change or edit the
header without written permission.

Please read the "legal small print," and other information about the
eBook and Project Gutenberg at the bottom of this file.  Included is
important information about your specific rights and restrictions in
how the file may be used.  You can also find out about how to make a
donation to Project Gutenberg, and how to get involved.


**Welcome To The World of Free Plain Vanilla Electronic Texts**

**eBooks Readable By Both Humans and By Computers, Since 1971**

*

In [11]:
initial_text = 'portugal_ISO-8859-1.txt'
port_text = encoding_tool(initial_text)
print(port_text)

The Project Gutenberg EBook of Portugal enfermo por vicios, e abusos de
ambos os sexos, by José Daniel Rodrigues da Costa

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net


Title: Portugal enfermo por vicios, e abusos de ambos os sexos

Author: José Daniel Rodrigues da Costa

Release Date: March 23, 2010 [EBook #31743]

Language: Portuguese

Character set encoding: ISO-8859-1

*** START OF THIS PROJECT GUTENBERG EBOOK PORTUGAL ENFERMO POR VICIOS ***




Produced by Pedro Saborano (produced from scanned images
of public domain material from Google Book Search)






                             PORTUGAL ENFERMO

                  POR VICIOS, E ABUSOS DE AMBOS OS SEXOS.

                            DEDICADO AO SENHOR

                            JOSÉ LUIZ GUERNER,

                        C

In [12]:
initial_text = 'shisei_UTF-8.txt'
jap_text = encoding_tool(initial_text)
print(jap_text)

The Project Gutenberg EBook of Shisei, by Junichiro Tanizaki

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net


Title: Shisei

Author: Junichiro Tanizaki

Release Date: March 13, 2010 [EBook #31617]

Language: Japanese

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK SHISEI ***




Produced by Kaoru Tanaka




Title: 刺靑 (Shisei)
Author: 谷崎潤一郞 (Junichiro Tanizaki)
Language: Japanese
Character set encoding: UTF-16
Text preparation by Kaoru Tanaka

-------------------------------------------------------
Notes on the signs in the text

《...》 shows ruby (short runs of text alongside the base text to indicate pronunciation).
Eg. 其《そ》

｜ marks the start of a string of ruby-attached characters.
Eg. 十三｜年目《ねんめ》

［＃...］ explains the formatting of the original text.
Eg. ［＃ここか

In [13]:
initial_text = 'yan_BIG-5.txt'
chi_text = encoding_tool(initial_text)
print(chi_text)

The Project Gutenberg EBook of Yan shi jia xun, by Yan Zhitui
#4 in our series by Yan Zhitui

Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.

This header should be the first thing seen when viewing this Project
Gutenberg file.  Please do not remove it.  Do not change or edit the
header without written permission.

Please read the "legal small print," and other information about the
eBook and Project Gutenberg at the bottom of this file.  Included is
important information about your specific rights and restrictions in
how the file may be used.  You can also find out about how to make a
donation to Project Gutenberg, and how to get involved.


**Welcome To The World of Free Plain Vanilla Electronic Texts**

**eBooks Readable By Both Humans and By Computers, Since 1971**

*****These eBooks Were Prepared By Thousands of Volunteers!*****


Title: Yan shi jia 