New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to detect ENC28J60 module "hang" state #167

Open
zbx-sadman opened this Issue Nov 15, 2016 · 23 comments

Comments

Projects
None yet
4 participants
@zbx-sadman

zbx-sadman commented Nov 15, 2016

I use ATmega328-based Arduino & ENC28J60 (revision 6) & Arduino_UIP (fix_errata12) in my hobby project for a long time. My device(s) handles requests from the monitoring service (Zabbix). But i can't achieve stable operation of the device. It can work two weeks on the production (noisy) network and hang then, or work only 2...3 hours and hang again. Also it can work for a month on the home network. I'll try to change power supply, set additional capacitors, give more RAM (~250 bytes free in the work state) and so... I have no luck in all cases - activity led is blink, but no any data in the ethClient.read()

Nevertheless, any attempts to hang module by hand was failed too ;) I have used flood-attacks without success - the module stops to answer for a few time and works again when attack is off.

After that i have add ENC28J60 registers state output in my sketch, connect device to noisy network and began to wait. When module is freeze, i have connected to the UART and saw that ECON1=00000000b.

ECON1 contain the RXEN bit : Receive Enable bit. If its 1 - packets which pass the current filter configuration will be written into the receive buffer. 0 - All packets received will be ignored.

Seems that cleared RXEN bit is the reason for such behavior - recieving no data with activity led blinking.

In the Arduino_UIP code, i did not find the RXEN bit cleaning operations. Now i'll try to call Enc28J60.init() when 0 == ECON1.RXEN.

I hope that the information was helpful And may be someone will found workaround to get more stability with ENC28J60.

@prezeskk

This comment has been minimized.

prezeskk commented Nov 15, 2016

Could you write a sample code to detect this situation?
In my project ENC28J60 hang too

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Nov 15, 2016

It's seems as dirty hack at this time.

I changed utility\Enc28J60Network.h file - move readReg() sub from private to public members area.

Then i use this pieces of the code:

 #include <UIPEthernet.h>
 #define NET_ENC28J60_ECON1                                0x1F
 #define NET_ENC28J60_ECON1_RXEN                           0x04
 #define NET_ENC28J60_CHECK_PERIOD                         5000UL  // 5 sec
  ...
 setup() {
  ...
  }

 loop () {
  uint32_t prevENCCheckTime;
  while (true) {
    uint32_t nowTime = millis();
    if (NET_ENC28J60_CHECK_PERIOD <= (uint32_t) (nowTime - prevENCCheckTime)) {
        // Enc28J60 is Enc28J60Network class that defined in Enc28J60Network.h
        // ENC28J60 ignore all incoming packets if ECON1.RXEN is not set
        if  (! (Enc28J60.readReg((uint8_t) NET_ENC28J60_ECON1) & NET_ENC28J60_ECON1_RXEN)) {
            Serial.println ("ENC28J60 reinit");
            Enc28J60.init(netConfig->macAddress); 
        }  
        prevENCCheckTime = nowTime;
     }
     ...
     various_network_procedures();
     ...
   }
}

@prezeskk

This comment has been minimized.

prezeskk commented Nov 15, 2016

Thanks

@gwest7

This comment has been minimized.

gwest7 commented Dec 17, 2016

@zbx-sadman did you flood the data to or from the ENC28J60? I can send a lot of data to the module, but when the module sends just a few bytes it hangs.

I also tested the init code above, but it had no improvement.

@gwest7

This comment has been minimized.

gwest7 commented Dec 17, 2016

Are you still using your module as described above or have you perhaps found a stabler library for the ENC?

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Dec 17, 2016

Yes, i've flood module with TCP and UDP traffic by hping3 --flood ip-address utility. Pinging is stops (ping ip-address in another terminal session) when flood was started and resume on flood breaking. fix_errata12 branch is used. And i has flood my sketch with a continuous flow of request for ~2 days - no hangs anyway.

I've found additional "signs" on hanged ENC module: ESTAT.BUFFER and ENC. EIR.RXERIF will be raised. It seems that sometime strange interference just bump the network chip. Maybe it's all the same electrostatic interference. I don't know. IP and MAC shows on debug output as it should be, but ENC module's registry say to me that error occured ... My ENC hanged on ~second week after restart and therefore code debugging is pain.

I still use Arduino_UIP because its have the same commands that stock Ethernet library and allow to use both Ethernet shields with the same code.

@prezeskk

This comment has been minimized.

prezeskk commented Dec 17, 2016

Here https://github.com/TMRh20/arduino_uip/tree/fix_errata12 is fork of this library with some changes

@gwest7

This comment has been minimized.

gwest7 commented Dec 17, 2016

@prezeskk are you running the fix_etrata12 fork with no lockups? I'm busy stress testing the fork with zbx-sadman's init and will let it run through the night.

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Dec 17, 2016

@gwest7, may be yours sketch have a memory leak? In my experiments i found out that low memory will stops network functions working. If 250 bytes remain free after compiling (by IDE report) - device stops to reply on ping. May be compiler does not calculate correctly memory allocating and some classes from included libs eat the memory on init... But for my sketch it is a fact - no ping reply if <250b memory is still free. I think that this situation require to deeper source code analyze, but i not great programmer and haven't any debug tools :(

@gwest7

This comment has been minimized.

gwest7 commented Dec 18, 2016

Ok, I'll check my sketch again because it did freeze up again during the night. The fix_etrar12 firk is definitely stabler. Thank you for the help!

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Dec 19, 2016

So, i has read datasheet and internet again. Network chip vendor recommend to use TTL level shifter on MISO (i don't use INT) for using ENC28J60 with 5V MCU. As might be expected - on my module from Aliexpress this shifter is not installed (but my device worked...)
enc28j60-appnote
And, on the Microchip forum, some participants are recommend to install the additional 100 ohm resistors (depicted as a blue rectangle) on other SPI bus lines, if the ENC is connected to 5V MCU.

I can't imagine how it protects the RX buffer in my case (and can't rewire my dev board at this time to testing), but may be it helps to you.

@gwest7

This comment has been minimized.

gwest7 commented Dec 19, 2016

Mine is connected to an Atmega328 at 5v soldered to a project board. Will make these changes too. Thanks for the effort!

@GunterO

This comment has been minimized.

GunterO commented Dec 19, 2016

Maybe I'm suggesting here things you've already tried, but my lockups went away when I didn't use the 3.3v power supply of the Arduino board anymore, and used a 5v->3.3V step down converter to power the board.
My ENC28J60 board wasn't capable of connecting directly to 5V, and the 3.3v supply of the Arduino is not sufficient (even it "apears" to work). In that case you don't need to use a TTL level converter for the control lines.

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Dec 19, 2016

@GunterO, yes using 3.3v from Arduino board is a bad idea )
Nevertheless, my module have onboard LM1117 LDO Regulator, but RX-error problem is exist anyway.

I agree - if network module and MCU board is powered from the same supply, voltage dropping on both regulators must be equal or close and ENC's TTL level must be enough for MCU, but who knows how good voltage regulator from Aliexpress ;) ENC can work on 3.1V (from datasheet), but it not enough voltage for 5V MCU input.

So, i think it's worth a try level converter. May be it did not help, but all will be wired as vendor recommend, and we can exclude this possible hardware error.

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Dec 21, 2016

I still fight with my ENC.

Today i wrote multithread app on perl to make more test of my device. This script just fork process that opened connection to device, send request (~15 bytes) and recieve answer. And no more. Device just wait ~1sec before send answer and close connection.

I supervised the chip's registers EPKTCNT, EIR, and some others. I've found that EIR.RXERIF bit will be raised on two session, but my device still active and answers the request. EPKTCNT registry shows to me ~22-23pkts. Microchip's datasheet says to me followng: When a packet is being received and the receive buffer runs completely out of space, or EPKTCNT is 255 and cannot be incremented, the packet being received will be aborted (permanently lost) and the EIR.RXERIF bit will be set to '1'. Once set, RXERIF can only be cleared by the host controller or by a Reset condition. And its true - RXERIF do not clear when i stop perl script. And device still working.... i did increase UIP_CONF_BUFFER_SIZE to 200 - no any luck, all the same.

It turns out that this small traffic (15 bytes request every seconds in two thread) are cause the ENC buffer overflow? But device still answer and have no dropped requests. It make me crazy.

Now i want to test new strategy: I will monitor the time elapsed since the last successful incoming connection, and if detect no activity on one minute or so - check ECON1.RXEN && EIR.RXERIF bits. If first is dropped and second is set - just make Enc28J60.init(...). It allow to reinit chip not so often to make ping of device without lost packets

I know - it looks like mad action, but i see no other ways to avoid hard hangs at this time.

@gwest7

This comment has been minimized.

gwest7 commented Dec 22, 2016

Good luck. Looking forward to see your results.

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Dec 23, 2016

@gwest7 Can you email me or give to me your address. I want check some guess, and your "fast hangs" device wlll helpful.

P.S. my email on the profile page

Tanx

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Jan 28, 2017

Hello again.

I continue my long-term testing of devices using ENC28J60 and see that at the moment they are working satisfactorily.

Now i think that my problem with often RXERIF rising may be due to duration of the subs that called from main loop, My project initially works with Wiznet module & lib and i never was not worried about state of network buffer, because it handled by hardware. It turned out that i must to run internal tick() subroutine as often as i can to avoid RX-error occur when use UIPEthernet. On short loop() it works seamlessly - tick() called by various procedures of library classes, like server.available(), Ethernet.maintain() and etc. But in my code calls subroutines which sometimes running a few seconds (maybe 0,5 .. 1,5s ) (you may just use delay(1000) in loop and get RXERIF rising),

Yes, i know - seems like i did silly error.

So, when i use more calls of Ethernet.maintain() - i got rarely RX-errors causes. But i still get it, because my subs need a time for make something. I found no RX-error handling in UIPEthernet code source (but not all RXERIF rises cause "hang" state of ENC) and written my own piece of code to get ENC RX-error state (TRUE == ESTAT_BUFFER & EIR_RXERIF) and incoming packet filter settings (ECON1_RXEN). My code call Enc28J60.init(mac) if detect chip error state with no incoming connection for a few time.

Let me illustrate (pics are clickable):

My "home" ENC-device, that do all so quikly (read DS18B20, DHT22 BMP180 sensors, and so)
image

My "work" ENC-device, that do all not so quikly (read DS18B20 and show chart on 8x8 led screen)
image

"Work" device placed in "dirty" network environment (many broadcasts, tcp connects to device), "home" device just relaxes in small network.

As you see - "home" device works >37 days and have only 1 re-init, "work" device have >28 days of uptime and make 31 reinit of ENC28J60. I see no lost sensor metrics on both devices - detection of "hang" state is timely.

I hope that information will be helpful.

P.S. My ENC-based device anwser to simple requiest more quikly that my Wiznet-based device (0m0.026s vs 0m0.211s). Strange, but it fact.

@GunterO

This comment has been minimized.

GunterO commented Jan 30, 2017

@zbx-sadman Thanks a lot for your work on this. Very interesting!
Do you have a final code snippet of your "hang fix" (via Enc28J60.init(mac))?
Thanks!

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Jan 30, 2017

@GunterO, i have many exceed code in my project, but i think that this piece of code shows how to detect "hang" state and reinit ENC module.

Also i trying to find more simple solution to clear RX-error state, not so radical like init()...

 #include <UIPEthernet.h>
 #define NET_ENC28J60_EIR                                 0x1C
 #define NET_ENC28J60_ESTAT                               0x1D
 #define NET_ENC28J60_ECON1                               0x1F
 #define NET_ENC28J60_EIR_RXERIF                          0x01
 #define NET_ENC28J60_ESTAT_BUFFER                        0x40
 #define NET_ENC28J60_ECON1_RXEN                          0x04

 #define NET_ENC28J60_CHECK_PERIOD                        5000UL  // 5 sec
  ...
 setup() {
  ...
  }

 loop () {
  uint32_t prevENCCheckTime, cntReinits;

  prevENCCheckTime = millis();
  cntReinits = 0;

  while (true) {
    uint32_t nowTime = millis();
    // This subroutine call internal tick() procedure, that need for ENC28J60 buffers processing by UIPEthernet. You can use it if not sure that other UIPEthernet's subroutines calls tick() periodically.
    Ethernet.maintain();

    if (NET_ENC28J60_CHECK_PERIOD <= (uint32_t) (nowTime - prevENCCheckTime)) {
       // Enc28J60 is Enc28J60Network class that defined in Enc28J60Network.h
       // readReg() subroutine must be moved from private to public members area in utility\Enc28J60Network.h
       // ENC28J60 ignore all incoming packets if ECON1.RXEN is not set
       uint8_t stateEconRxen = Enc28J60.readReg((uint8_t) NET_ENC28J60_ECON1) & NET_ENC28J60_ECON1_RXEN;
       // ESTAT.BUFFER rised on TX or RX error
       // I think the test of this register is not necessary - EIR.RXERIF state checking may be enough
       uint8_t stateEstatBuffer = Enc28J60.readReg((uint8_t) NET_ENC28J60_ESTAT) & NET_ENC28J60_ESTAT_BUFFER;
       // EIR.RXERIF set on RX error
       uint8_t stateEirRxerif = Enc28J60.readReg((uint8_t) NET_ENC28J60_EIR) & NET_ENC28J60_EIR_RXERIF;
       if (!stateEconRxen || (stateEstatBuffer && stateEirRxerif)) {
          Serial.println ("ENC28J60 reinit");
          Enc28J60.init(netConfig->macAddress);
          cntReinits++;
       } 
       // No error detected. Stop module testing for NET_ENC28J60_CHECK_PERIOD
       prevENCCheckTime = millis();
     }
    // Checking EthernetServer for new incoming connection, fire up status led and do other system actions
    ...
    if (!ethClient.available()) { continue; }
    incomingData = ethClient.read();
    // Data recieving finished
    if ('\n' == incomingData) {
       // Call certain subroutines to obtaining data from sensors, sending data to remote host, and etc.
       ...
       ethClient.stop(); 
       // It was successfull connect - do not check ENC28J60 for NET_ENC28J60_CHECK_PERIOD 
       prevENCCheckTime = millis();
     } 
   }
}

@GunterO

This comment has been minimized.

GunterO commented Jan 30, 2017

Great, thanks!

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Jan 31, 2017

I forget to add Ethernet.maintain() call to my example code. It fixed now.

One more interesting fact: including Ethernet.maintain() to the network loop increase a little (-40 bytes for my test code) free program storage space if UIPEthernet used, but eat MCU flash (~3,3kB) & RAM (~40 bytes) when i just change network driver to Wiznet's Ethernet.h. Wiznet's drivers just includes DHCP functionality to firmware, even you won't to use it.

So, you must detect this situation like that:

....
#if defined(FEATURE_NET_DHCP_ENABLE) || defined (TRANSPORT_ETH_ENC28J60)
    result = Ethernet.maintain();
#endif
...

@zbx-sadman

This comment has been minimized.

zbx-sadman commented Apr 13, 2017

I want to inform you about my experiment results - ENC28J60 based device twice roll over max millis() period without hanging up. But i need to update firmware and i have to stops this process.

...and i really do not understand why the number of re-inits ~24.03 has increased dramatically. But device still works, serve requests without rest, and lost no sensors data at everytime.

zabbuino_enc28j60_stat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment