# Germen Sentiment Model

ressources: 
https://sites.google.com/view/germeval2017-absa/data?authuser=0
https://huggingface.co/oliverguhr/german-sentiment-bert

The model uses the Googles Bert architecture and was trained on 1.834 million German-language samples. The training data contains texts from various domains like Twitter, Facebook and movie, app and hotel reviews. You can find more information about the dataset and the training process in the [paper]('http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.201.pdf').
  
  
  
If you are interested in code and data that was used to train this model please have a look at this repository and our paper. Here is a table of the F1 scores that his model achieves on following datasets. Since we trained this model on a newer version of the transformer library, the results are slightly better than reported in the paper.  
  
| Dataset                                                             	| f1 score 	|
|---------------------------------------------------------------------	|----------	|
| [holidaycheck]('https://github.com/oliverguhr/german-sentiment')    	| 0.9568   	|
| [scare]('https://www.romanklinger.de/scare/')                       	| 0.9418   	|
| [filmstarts]('https://github.com/oliverguhr/german-sentiment')      	| 0.9021   	|
| [PotTs]('https://www.aclweb.org/anthology/L16-1181/')               	| 0.6780   	|
| [germeval]('https://sites.google.com/view/germeval2017-absa/home')  	| 0.7536   	|
| [sb10k]('https://www.spinningbytes.com/resources/germansentiment/') 	| 0.7376   	|
| [emotions]('https://github.com/oliverguhr/german-sentiment')        	| 0.9649   	|
| AVERAGE                                                             	| 0.85     	|

In [11]:
!ls ./raw

german-bert-sentiment.tar.gz unpack.py


In [249]:
import boto3
import os
import tarfile
import io
import base64
import json
from transformers import AutoModelForSequenceClassification, AutoTokenizer,AutoConfig
s3 = boto3.client('s3')

class Model():
    def __init__(self,model_path:str,s3_bucket=None,file_prefix=None):
        #load model
        self.model,self.tokenizer = self.from_pretrained(model_path,s3_bucket,file_prefix)
        #helper functions
        self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
        self.clean_http_urls = re.compile(r'https*\S+', re.MULTILINE)
        self.clean_at_mentions = re.compile(r'@\S+', re.MULTILINE)

    def replace_numbers(self,text: str) -> str:
        # replace numbers 0-9 to real strings
        return text.replace("0"," null").replace("1"," eins").replace("2"," zwei").replace("3"," drei").replace("4"," vier").replace("5"," fünf").replace("6"," sechs").replace("7"," sieben").replace("8"," acht").replace("9"," neun")         

    def clean_text(self,text: str)-> str:    
        text = text.replace("\n", " ")        
        text = self.clean_http_urls.sub('',text)
        text = self.clean_at_mentions.sub('',text)        
        text = self.replace_numbers(text)                
        text = self.clean_chars.sub('', text)                        
        text = ' '.join(text.split()) 
        text = text.strip().lower()
        return text
    
    def save_model(self,out_path:str,model_name='model'):
        self.model.save_pretrained(out_path)
        self.tokenizer.save_pretrained(out_path)
        pack_model(out_path,model_name)

    def load_model(self,model_path:str):
        if os.path.isfile(f'{model_path}/pytorch_model.bin'):
            model  = AutoModelForSequenceClassification.from_pretrained(model_path)
            config = AutoConfig.from_pretrained(f'{model_path}/config.json')
        return model
    
    def load_model_from_s3(self,model_path:str,s3_bucket:str,file_prefix:str):
        if model_path and s3_bucket and file_prefix:
            obj = s3.get_object(Bucket=s3_bucket, Key=file_prefix)
            bytestream = io.BytesIO(obj['Body'].read())
            tar = tarfile.open(fileobj=bytestream, mode="r:gz")
            config= AutoConfig.from_pretrained(f'{model_path}/config.json')
            for member in tar.getmembers():
                if member.name.endswith(".bin"):
                    f = tar.extractfile(member)
                    state = torch.load(io.BytesIO(f.read()))
                    model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=None ,state_dict=state, config=config)
            return model
        else:
            raise KeyError('No S3 Bucket and Key Prefix provided')
    
    def load_tokenizer(self,model_path:str):
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        return tokenizer

    def from_pretrained(self,model_path:str,s3_bucket:str,file_prefix:str):
        if os.path.isfile(f'{model_path}/pytorch_model.bin'):
            model = self.load_model(model_path)
        else:
            model = self.load_model_from_s3(model_path,s3_bucket,file_prefix)
        tokenizer = self.load_tokenizer(model_path)
        return model,tokenizer
    
    def predict_sentiment(self, texts: Union[List[str],str] )-> List[str]:
        try:
            if isinstance(texts,str):
                texts = [texts]
            texts = [self.clean_text(text) for text in texts]
          # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
            input_ids = self.tokenizer.batch_encode_plus(texts,pad_to_max_length=True, add_special_tokens=True)
            input_ids = torch.tensor(input_ids["input_ids"])

            with torch.no_grad():
                logits = self.model(input_ids)    
            print(logits[0])
            label_ids = torch.argmax(logits[0], axis=1)

            labels = [self.model.config.id2label[label_id] for label_id in label_ids.tolist()]
            if len(labels) == 1:
                return labels[0]
            return labels
        except Exception as e:
            raise(e)

In [250]:
!rm  ./model/pytorch_model.bin 

rm: ./model/pytorch_model.bin: No such file or directory


In [252]:
model = Model('./model','philschmid-models','sentiment_classifier/german-bert-sentiment.tar.gz')

In [255]:
model.predict_sentiment('Der Aktienkurs für Puma ist sehr gut.')

tensor([[ 1.4624, -0.8481,  0.0963]])


'positive'

In [234]:
!transformers-cli env


Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 2.10.0
- Platform: macOS-10.15.3-x86_64-i386-64bit
- Python version: 3.8.2
- PyTorch version (GPU?): 1.5.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

