# Sofeware Engineering for Data Scientist

* **Modularity**
    * Readability, maintainability, solve problem only once
    * Object, Class, Package
* **Documentation**
    * Prevent confusion and frustration
* **Testing**
    * Find & fix more bugs, run tests anytime/anywhere

## 1. PEP 8 coding style

In [6]:
%%capture
# PEP 8 coding style
!pip install pycodestyle

# pycodestyle xx.py

In [20]:
%%capture
# Import needed package
import pycodestyle

# Create a StyleGuide instance
style_checker = pycodestyle.StyleGuide()

# Run PEP 8 check on multiple files
result = style_checker.check_files(['DataTypes.ipynb',
                                    'Efficient_Python.ipynb'])

# Print result of PEP 8 style check
print(result.messages)

13:1: W391 blank line at end of file


In [10]:
%%capture
pip install flake8 pycodestyle_magic

In [11]:
%load_ext pycodestyle_magic

In [12]:
%pycodestyle_on
# start to check jupyter notebook's style

## 2. Writing First Package

<div>
<img src="py_package.png" width="600"/>
</div>

### 2.1 Importing functions
If you wrote two functions for you package in the file `counter_utils.py` named `plot_counter` & `sum_counters`, which of the following lines would correctly import these functions in `__init__.py` using relative syntax? <br/>

`from .counter_utils import plot_counter, sum_counters`

### 2.2 requirements.txt 
> It can help your package be more portable by allowing your users to easily recreate its intended environment

E.g. <br/>
requirements = """ <br/>
`matplotlib`>=3.0.0 <br/>
`numpy`==1.15.4 <br/>
`pandas`<=0.22.0 <br/>
`pycodestyle` <br/>
""" <br/>

`pip install -r requirements.txt`

### 2.3 setup.py

**# Import needed function from setuptools**<br/>
`from setuptools import setup`

**# Create proper setup to be used by pip**<br/>
`setup(name='text_analyzer',
      version='0.0.1',
      description='Perform and visualize a text anaylsis.',
      author='meredith',
      author_email='sss@gmail.com',
      packages=['text_analyzer']),
      install_requires=['numpy']`

## 3. Classes

In [2]:
# Define Document class
class Document:
    """A class for text analysis
    
    :param text: string of text to be analyzed
    :ivar text: string of text to be analyzed; set by `text` parameter
    """
    # Method to create a new instance of MyClass
    def __init__(self, text):
        # Store text parameter to the text attribute
        self.text = text

In [4]:
datacamp_tweet = 'Basic linear regression example. #DataCamp #DataScience #Python #sklearn'

# Create an instance of Document with datacamp_tweet
my_document = Document(text=datacamp_tweet)
# Print the text attribute of the Document instance
print(my_document.text)

Basic linear regression example. #DataCamp #DataScience #Python #sklearn


### non-public method using _method

In [18]:
%%capture
pip install nltk

In [4]:
from collections import Counter
from nltk.tokenize import word_tokenize

class Document:
  def __init__(self, text):
    self.text = text
    # Tokenize the document with non-public tokenize method
    self.tokens = self._tokenize()
    # Perform word count with non-public count_words method
    self.word_counts = self._count_words()

  def _tokenize(self):
    return word_tokenize(self.text)

  # non-public method to tally document's word counts with Counter
  def _count_words(self):
    return Counter(self.tokens)

In [10]:
with open('datacamp_tweets.txt') as f:
    datacamp_tweets = str(f.readlines())

In [11]:
# create a new document instance from datacamp_tweets
datacamp_doc = Document(text=datacamp_tweets)

# print the first 5 tokens from datacamp_doc
print(datacamp_doc.tokens[:5])

# print the top 5 most used words in datacamp_doc
print(datacamp_doc.word_counts.most_common(5))

['[', "'datacamp_tweets", '=', '\\', "'"]
[('@', 891), ('#', 331), ('DataCamp', 305), (':', 291), (',', 271)]


### Dry principle
> Don't repeat yourself

With inheritance, we start with a **parent** class and we pass on it's functionality to a **child** class <br/>
The **child** class inherits all the methods and attributes of its parent, <br/>
and we're able to add additional functionality without affecting the **parent** class

In [25]:
# Define a SocialMedia class that is a child of the `Document class`
class SocialMedia(Document):
    def __init__(self, text):
        Document.__init__(self, text) # need to init the parent class
        self.ten_words = self._ten_words() # child's own attributes
        self.eleven_words = self._eleven_words()
    
    def _ten_words(self):
        return self.tokens[:10]
    
    def _eleven_words(self):
        return self.tokens[11:21]

In [26]:
datacamp_sm = SocialMedia(text=datacamp_tweets)

print(datacamp_sm.ten_words)
print(datacamp_sm.eleven_words)

['[', 'DataCamp', ']', 'Introduction', 'to', 'H2O', 'AutoML', '--', '>', 'In']
['tutorial', ',', 'you', 'will', 'learn', 'about', 'H2O', 'and', 'have', 'a']


In [53]:
dir(datacamp_sm)[-7:-1]

['_ten_words', '_tokenize', 'eleven_words', 'ten_words', 'text', 'tokens']

### Multilevel Inheritance 


In [46]:
class Parent:
    def __init__(self):
        print("I'm a parent")

class Child(Parent):
    def __init__(self):
        Parent.__init__(self) # here we have self
        print("I'm a child") 

class SuperChild(Parent):
    def __init__(self):
        super().__init__() # here we don't have self
        print("I'm a super child!")

class Grandchild(SuperChild):
    def __init__(self):
        super().__init__()
        print('I am a grandchild!')

In [47]:
print(Parent())
print(Child())
print(SuperChild())
print(Grandchild()) # as a grandchild for super child

I'm a parent
<__main__.Parent object at 0x124c97e50>
I'm a parent
I'm a child
<__main__.Child object at 0x124c97cd0>
I'm a parent
I'm a super child!
<__main__.SuperChild object at 0x124c97e50>
I'm a parent
I'm a super child!
I am a grandchild!
<__main__.Grandchild object at 0x124c97cd0>


## Documentation

def function(x):
    
    """High level description of function 
    
    Additional details on function
    
    :param x: description of parameter x 
    :return: description of return value
    
    >>> function(example)
    example outcome
    """

In [56]:
import re

In [57]:
def tokenize(text, regex=r'[a-zA-z]+'):
  """
  Split text into tokens using a regular expression

  :param text: text to be tokenized
  :param regex: regular expression used to match tokens using re.findall 
  :return: a list of resulting tokens

  >>> tokenize('the rain in spain')
  ['the', 'rain', 'in', 'spain']
  """
  return re.findall(regex, text, flags=re.IGNORECASE)

In [58]:
tokenize('the rain in spain')

['the', 'rain', 'in', 'spain']

### Unit Testing

`doctest`
`pytest`

In [60]:
def square(x):
    """
    Square the number x
    
    :param x: number to square
    :return: x squared
    
    >>> square(3)
    9
    """
    return x ** x

import doctest
doctest.testmod()

**********************************************************************
File "__main__", line 8, in __main__.square
Failed example:
    square(3)
Expected:
    9
Got:
    27
**********************************************************************
1 items had failures:
   1 of   1 in __main__.square
***Test Failed*** 1 failures.


TestResults(failed=1, attempted=2)

### pytest

It searches for files that start or end with the word 'test'

In [64]:
# # working in workdir/tests/test_document.py
# from text_analyzer import Document

# # Test tokens attribute on Document object
# def test_document_tokens():
#     doc = Document('a e i o u')
    
#     assert doc.tokens == ['a','e','i','o','u']

### Other available tools
**Sphinx**: generate beautiful documentation <br/>
**Travic CI**: Continuously test your code <br/>
**Codecov**: Discover where to improve your projects tests <br/>
**Code Climate**: Analyze your code for improvements in readability <br/>


In [62]:
# # for Sphinx
# from text_analyzer import Document

# class SocialMedia(Document):
#     """Analyze text data from social media
    
#     :param text: social media text to analyze

#     :ivar hashtag_counts: Counter object containing counts of hashtags used in text
#     :ivar mention_counts: Counter object containing counts of @mentions used in text
#     """
#     def __init__(self, text):
#         Document.__init__(self, text)
#         self.hashtag_counts = self._count_hashtags()
#         self.mention_counts = self._count_mentions()