# Word count problem

***Problem Statement:*** To find the the number of of words occuring for each length from the given file named as 'wordcountproblem.txt' .

for eg, number of words of length 1, number of words of length 2, and so on.

***Module used:***

    - re : Regular expression library
    
***Variables represent:***

    - text : contents of the file 'wordcountproblem.txt' -> type str
    
    - word_re : regular expression for finding words of any length -> type re.Pattern
    
    - w : list of tuples, each tuple matching the contents of regular expression 'word_re' -> type list
    
    - word_list : list of words present in the text -> type list
    
    - ans_dict : contains length of word as keys and number of occurences of that length as corresponding value -> type dict
    
    - list_dict : contains length of word as keys and a list of words of that length as value (duplicates present) -> type dict
    
    - set_list_dict : removing all duplicate values from 'list_dict' -> type dict
    

In [1]:
import re

In [2]:
with open('wordcountproblem.txt') as f: ##reading text from file
    text = f.read()

In [3]:
text

'Big Data and Hadoop Design of HDFS, HDFS Concepts, Command Line Interface, Hadoop File Systems, Java Interface, Data Flow (Anatomy of a File Read, Anatomy of a File Write, Coherency Model), Parallel Copying with DISTCP, Hadoop Archives Cluster Specification, Cluster Setup and Installation, SSH Configuration, Hadoop Configuration (Configuration Management, Environment Settings, Important Hadoop Daemon Properties, Hadoop Daemon Addresses and Ports, Other Hadoop Properties, User Account Creation), Security, Benchmarking a Hadoop ClusterCluster Specification, Cluster Setup and Installation, SSH Configuration, Hadoop Configuration (Configuration Management, Environment Settings, Important Hadoop Daemon Properties, Hadoop Daemon Addresses and Ports, Other Hadoop Properties, User Account Creation), Security, Benchmarking a Hadoop Cluster Hadoop Data Types, Functional Programming Roots, Imperative vs Functional Programming, Concurrency and lock free data structure, Functional - Concept of Map

In [4]:
word_re = re.compile(r'([a-zA-Z]+)([\s.,/!])') ##regular expression for finding words of any length

In [5]:
w = re.findall(word_re, text)
w

[('Big', ' '),
 ('Data', ' '),
 ('and', ' '),
 ('Hadoop', ' '),
 ('Design', ' '),
 ('of', ' '),
 ('HDFS', ','),
 ('HDFS', ' '),
 ('Concepts', ','),
 ('Command', ' '),
 ('Line', ' '),
 ('Interface', ','),
 ('Hadoop', ' '),
 ('File', ' '),
 ('Systems', ','),
 ('Java', ' '),
 ('Interface', ','),
 ('Data', ' '),
 ('Flow', ' '),
 ('Anatomy', ' '),
 ('of', ' '),
 ('a', ' '),
 ('File', ' '),
 ('Read', ','),
 ('Anatomy', ' '),
 ('of', ' '),
 ('a', ' '),
 ('File', ' '),
 ('Write', ','),
 ('Coherency', ' '),
 ('Parallel', ' '),
 ('Copying', ' '),
 ('with', ' '),
 ('DISTCP', ','),
 ('Hadoop', ' '),
 ('Archives', ' '),
 ('Cluster', ' '),
 ('Specification', ','),
 ('Cluster', ' '),
 ('Setup', ' '),
 ('and', ' '),
 ('Installation', ','),
 ('SSH', ' '),
 ('Configuration', ','),
 ('Hadoop', ' '),
 ('Configuration', ' '),
 ('Configuration', ' '),
 ('Management', ','),
 ('Environment', ' '),
 ('Settings', ','),
 ('Important', ' '),
 ('Hadoop', ' '),
 ('Daemon', ' '),
 ('Properties', ','),
 ('Hadoop', ' 

In [6]:
word_list = [w[i][0] for i in range(len(w))] ## creating list of words found

In [7]:
ans_dict = {}
for i in range(len(word_list)):
    if len(word_list[i]) in ans_dict:
        ans_dict[len(word_list[i])] +=1
    else:
        ans_dict[len(word_list[i])] = 1

In [8]:
ans_dict ## keys:lenngth of words , values: count of occurences

{3: 1094,
 4: 1090,
 6: 944,
 2: 1003,
 8: 515,
 7: 698,
 9: 457,
 1: 225,
 5: 595,
 13: 8,
 12: 103,
 10: 304,
 11: 167,
 14: 13,
 15: 6}

In [9]:
list_dict = {i:[] for i in range(16)} ##list of words with duplicates present

In [10]:
for i in range(len(word_list)):
    list_dict[len(word_list[i])].append(word_list[i].lower())

In [11]:
set_list_dict = {i:set(list_dict[i]) for i in range(16)} ##removed duplicates from list_dict

In [12]:
set_list_dict

{0: set(),
 1: {'a', 'j', 's', 't'},
 2: {'an',
  'as',
  'by',
  'if',
  'in',
  'is',
  'it',
  'of',
  'on',
  'or',
  'so',
  'to',
  'vs'},
 3: {'ago',
  'all',
  'and',
  'any',
  'are',
  'big',
  'can',
  'few',
  'for',
  'has',
  'how',
  'map',
  'not',
  'see',
  'ssh',
  'the',
  'toy',
  'use',
  'was',
  'you'},
 4: {'been',
  'code',
  'data',
  'doug',
  'even',
  'file',
  'flow',
  'free',
  'from',
  'gets',
  'good',
  'hand',
  'hdfs',
  'hive',
  'hold',
  'into',
  'java',
  'join',
  'just',
  'less',
  'lies',
  'line',
  'lock',
  'news',
  'next',
  'open',
  'read',
  'sets',
  'site',
  'soon',
  'stop',
  'term',
  'text',
  'than',
  'that',
  'them',
  'this',
  'used',
  'user',
  'were',
  'what',
  'with'},
 5: {'after',
  'based',
  'being',
  'error',
  'every',
  'fault',
  'given',
  'input',
  'known',
  'large',
  'level',
  'named',
  'other',
  'paper',
  'ports',
  'roots',
  'setup',
  'share',
  'since',
  'store',
  'table',
  'terms',
  

In [13]:
for k,v in list_dict.items(): ## printing required ans
    print (k, len(v))

0 0
1 225
2 1003
3 1094
4 1090
5 595
6 944
7 698
8 515
9 457
10 304
11 167
12 103
13 8
14 13
15 6
